Add clustering method for basins fractions continuation

Datseris commented 1 year ago

WIP, will write more here later.

Datseris commented 1 year ago

cc @awage @KalelR @maximilian-gelbrecht . I'm currently thinking how to test this. I guess use the systems Max used in his paper for a start.

Datseris commented 1 year ago

This is an example of how the method is called: test/basins/clustering_continuation_tests.jl

Datseris commented 1 year ago

@maximilian-gelbrecht would you please be kind to help us here on how to add the different clustering metrics that know of the parameter values? At the moment the code clusters across parameter space, but the clustering is unaware of the different parameter values.

maximilian-gelbrecht commented 1 year ago

Excuse my silence in the last weeks, I was on vacation.

For the MCBB method, I used as a clustering metric between integration/sample $i$ and $j$ $$D_{ij} = \sum_k w_k || \mathcal{S}_k(\rho_i) - \mathcal{S}_k(\rho_j) || + w_p || p_i - p_j || $$, where $\mathcal{S}_k(\rho_i)$ is the k-th statistic of some invariant measure of the attractor of $i$ (like the mean or variance in the easiest case), $||.||$ some metric of your choice (in my case either a Euclidean distance or a Wasserstein metric) and the $w_i$ are additional weights, that I often just kept as equal to 1. So the parameter just comes into play as an additional term in the distance matrix.

This distance matrix is then used as an input to DBSCAN (or some other density based clustering).

I'll look deeper into the code (and the MCBB implementation) hopefully on Friday.

Datseris commented 1 year ago

Yeah this I remember from your paper, how you did it in code I am searching for at the moment :D We call Clustering.dbscan . I guess you created a new metric and passed it to dbscan.

maximilian-gelbrecht commented 1 year ago

Just pass a matrix to Clustering.dbscan. In my old implementation I did either use regular full matrices, sparse matrices (for larger problems) or memory mapped arrays (for huge matrices/runs with $\mathcal{o}(10^5)$ trials

maximilian-gelbrecht commented 1 year ago

After taking a look at the code, I see two possibilities

Change the way dbscan is called to use a distance matrix and then write routines to compute said distance matrix. This is the most flexible way and also what I'd need for implementing MCBB anyway (due to e.g. using a Wasserstein metric in some cases). This could be done by making features a struct that holds the vector of vectors and additional infos about the metric to be computed and then just define distance_matrix(f::Features) and use that as an input for DBSCAN. The other inputs to DBSCAN don't change, we'd still need $\epsilon$ and $k$
Just adding the parameter as feature, but mapper.featurizer is also just called with the trajectory and time as an input, so it would also need some (small) changes to that

Datseris commented 1 year ago

yeah @KalelR has opened a PR which passes a distance matrix directly to the dbscan.

maximilian-gelbrecht commented 1 year ago

Ok, great. I've just looked at that. The way this PR looks like, we could proceed by adding the parameters to the feature vector then and then using e.g. a weighted euclidean as a metric (so that one can tweak how much the parameters are weighted).

I usually rescaled the parameter values for the clustering / distance calculation (e.g. to a [0,1] interval).

For MCBB, I'd then would need to add a new metric (with the histograms).

maximilian-gelbrecht commented 1 year ago

For that extract_features would need to get the parameter p as an additional input argument, and the featurizer as well

maximilian-gelbrecht commented 1 year ago

Shall we do this in this PR or in a separate one?

Datseris commented 1 year ago

In this PR. We're waiting for the PR that re-works clustering to use distance matrix to be merged in first.

awage commented 1 year ago

@KalelR I need some advice on the clustering of features. In the MCBB method, the clustering happens on a distance matrix. It is not directly on the features. How do you choose the optimal radius in this case?

The code is on line 59 of continuation_clustering.jl :

    Dk = [ sum(abs.(x .- y)) for x in features, y in features]

    # Cluster them
    db_res = dbscan(Dk, 5, round(Int, 5))
    cluster_labels = db_res.assignments

KalelR commented 1 year ago

The core of the optimal radius method just runs dbscan a certain amount of times, searching for the optimal clustering. So running dbscan is the same, i.e. db_res = dbscan(Dk, radius, round(Int, 5)). The only difference now without the features would be in how we guess the candidate radii.

What we currently do with the features is to look at the difference between the extremes of the features along each dimension of the clustering/feature space (i.e along each distinct type of feature). Then we base the radii on the smallest such difference. In terms of distances, what we do now would be to find the biggest distance along each dimension of the clustering space (so, for each dimension, the biggest gap between points) and then select the smallest value across dimensions. This value would be minimum(feat_ranges). I'm not sure how well it works, but I would go for it first.

A quick alternative might be to just guess between the smallest distance and the maximum and hope that Optim.jl finds a good value quickly. Alternatively, I've been thinking that choosing the candidate radii based on the optimal radius given by the knee method could work well. I can work on this tomorrow if what I said didn't make sense or if it doesn't work well :)

awage commented 1 year ago

Thanks! I made a version that follows your implementation. You don't need to spend much time on this version. I will think about the tests.

Datseris commented 1 year ago

@awage please take the changes this PR introduces and open a PR in Attractors.jl. https://github.com/JuliaDynamics/Attractors.jl you and @KalelR have push rights there. I have removed all attractor related stuff in this repo in a refactor branch. Notice that Attractors.jl requires DYnSysBase 2.9 and DelayEmbeddings 2.5, both of which will be tagged soon.

JuliaDynamics / ChaosTools.jl

Add clustering method for basins fractions continuation #279