Generic method for clustering attractors and initial conditions via featurizing

KalelR commented 2 years ago

Idea is to implement a function that can be used (i) by the user to cluster attractors and (ii) by bSTAB. There's some code repetition because the initial conditions need to be integrated and have their features extracted in the same loop (so we don't need to keep all trajectories), but the attractors of course don't. So the path for both types is (i): initial conditions -> integrate into trajectories and extract into features in the same loop -> clusterize (ii): attractors (or trajectories in general that are already saved) -> extract into features -> clusterize

So the only part that is used identically by both is the clusterizing.

This is still not finished, but I decided to open the PR already to ask for your advice. What I did so far is:

Defined cluster_datasets and the accompanying extract_features. The latter works exactly as in the case for receiving initial conditions, except of course it does not integrate any trajectories.
Use cluster_datasets in basins_fractions
Added a comment for basins_fraction just to explain what N is (unrelated to the PR itself, but helpful in the code).
Added an additional check for classify_features_clustering for the case the number of features (ie of ics or of attractors) is smaller than the dimension of the feature space (ie how many features are calculated). This is because dbscan errors otherwise, and is not necessarily only for the case for here where we input attractors. It could also be a source of error for inputing ics. It's a very unlikely case, and I only noticed it could happen now while testing this new function.

The problem is that, currently, the function receives the mapper AttractorsViaFeaturizing, which requires a dynamical system. But requiring a dynamical system makes no sense when we're inputting just the attractors, as we just want to featurize then cluster them, not integrate anything. An exception would be if the user wants the supervised method, and would additionally to the attractors also input ics as templates for the clustering. Then the algorithm would integrate these initial conditions and compare them with the input attractors, and therefore the dynamical system would of course be required. But for the unsupervised case I'm not sure what to do. We could

Ask the user to just input some dummy dynamical system (eg logistic() or even dummy() or something)
Change AttractorsViaFeaturizing so that it can be created without inputing a dynamical system (doesn't seem intuitive)
Create a new struct ClustersViaFeaturizing or something (some more code repetition)

Or something better that I haven't though. What do you think?

I also haven't tested the current code extensively because attractor_mapping_tests are failing even in the master branch for me. Is this also happening on other machines?

Datseris commented 2 years ago

Hmm, I see why the design here isn't that good... Because we have AttractorsViaFeaturizing both be an attractor mapper subtype, but also contain the keywords for clustering. It is really weird to have to provide a dynamical system for clustering datasets, so we need to rethink the design.

In my view, cluster_datasets should be independent of AttractorMapper and its subtypes. This means, it should take in an iterator of Datasets only and should not require an AttractorsViaFeaturizing instance. (By the way: there is no reason to limit the collection of datasets to a dictionary. any collection should be allowed. I.e., do not specify type).

There are two ways to go about this:

Make the function cluster_datasets have the same keywords as AttractorsViaFeaturizing.
Make a new parameters struct that has the specification of the clustering process. ClusteringSpecs(...). Then this is given as a second argument to cluster_datasets but also as an argument to AttractorsViaFeaturizing (which now will not have any keywords for the clustering).

I prefer the second method much more. First, it allows us to expand the clustering process with other clustering types if need be in the future, and second it removes the duplication of the clustering keyword arguments, that would otherwise have ot be given to both AttractorsViaFeaturizing and cluster_datasets.

What do you think? If you agree, do you think you can rework the code in this manner? This means that all clustering functions will no longer take in a AttractorsViaFeaturizing instance. Rather, they would take the ClusteringSpecs instance (which itself will be a field of AttractorsViaFeaturizing).

Datseris commented 2 years ago

hm, i'm not sure what to do with the supervised method for cluster_datasets...

KalelR commented 2 years ago

What do you think? If you agree, do you think you can rework the code in this manner? This means that all clustering functions will no longer take in a AttractorsViaFeaturizing instance. Rather, they would take the ClusteringSpecs instance (which itself will be a field of AttractorsViaFeaturizing).

Yeah, good idea. This is what I did now.

KalelR commented 2 years ago

hm, i'm not sure what to do with the supervised method for cluster_datasets...

Yeah, the supervised method is currently broken because we need to decide what to do. What I thought was:

For when passing trajectories (when the cluster_datasets receives only ClusteringSpecs), the user can pass other attractors as templates. These would be a field attractors_template in ClusteringSpecs. The features of these are then extracted and used as templates by kNN.

For when passing ics (in basin_fractions which receives AttractorsViaFeaturizer), the user could either (i) pass trajectories as templates (using the field attractors_template in ClusteringSpecs), exactly as in the above case, and using the same code; or (ii) pass initial conditions via attractors_ic in AttractorsViaFeaturizing. For the latter, the ics would be integrated and then the resulting trajectories would be added to ClusteringSpecs to be used for the clustering. This would then be compatible with the previous versions of the code.

But I'm having trouble with this method (ii) now because it needs to modify the already-existing cluster ClusteringSpecs and change attractors_template from Nothing to the integrated trajectories. Just making the struct mutable does not work, because that does not allow for changing the type of the element. Not sure what the best solution is.

(I will update the documentation as soon as we solve this)

Datseris commented 2 years ago

Please move the docstring of the function

function ClusteringSpecs(featurizer::Function;
        attractors_template::Union{AbstractDataset, Dict, Nothing, Vector}=nothing,
        clust_method_norm=Euclidean(), clustering_threshold = 0.0, min_neighbors = 10,
        clust_method = clustering_threshold > 0 ? "kNN_thresholded" : "kNN", 
        rescale_features=true, optimal_radius_method="silhouettes",
    )

to be the docstring the type definition, and then move this function directly after the type definition. It makes for much more clear source code. At the moment you need to scroll several 100s of loc to find the docstring of the type ClusteringSpecs.

Let's rename ClusteringSpecs to ClusteringConfig. Most Julia packages use the word "Config" when defining configuration options, so this name is a bit more harmonious.

Datseris commented 2 years ago

At the moment the usage of cluster_datasets is confusing... What does this function do?

function cluster_datasets(dataset::Union{AbstractDataset, Function},
     mapper::AttractorsViaFeaturizing; show_progress=true, N=1000)
    feature_array = extract_features(mapper, dataset; show_progress, N)
    if !isnothing(mapper.attractors_ic) 
        attractors_template = integrate_ics(mapper, mapper.attractors_ic; show_progress=false)
        mapper.cluster_specs.attractors_template = attractors_template #errors: MethodError: Cannot `convert` an object of type Vector{Dataset} to an object of type Nothing
    end

    class_labels, class_errors = classify_features(feature_array, mapper.cluster_specs)
    return class_labels, class_errors
end

It both creates the datasets, and then clusters them? if so, then this is not good design. A function should have a single purpose, and this is the purpose conveyed by the name. We need to split the code so that the code that creates the features is separated by the code that clusters them.

Notice that it's okay that cluster_dataset maps datasets to features and then clusters them. but generating the datasets should happen in a separate function.

Datseris commented 2 years ago

Okay, seems like we need a rework here. The AttractorsViaFeaturizing does NOT need to store all datasets/trajectories produced. Yet, that's exactly what happens in this function:

function integrate_ics(mapper::AttractorsViaFeaturizing, ics::AbstractDataset;
    show_progress = true)
    N = size(ics, 1) # number of actual ICs
    trajectories = Vector{Dataset}(undef, N)
    Threads.@threads for i ∈ 1:N
        ic = ChaosTools._get_ic(ics,i)
        trajectories[i] = trajectory(mapper.ds, mapper.total, ic;
        Ttr = mapper.Ttr, Δt = mapper.Δt, diffeq = mapper.diffeq)
        show_progress && ProgressMeter.next!(progress)
    end
    return trajectories
end

We need to separate code into three parts:

Clustering features.
Translating existing datasets into features.
Producing, and on the spot, translating trajectories into features, which is what AttractorsViaFeaturizing wants to do.

There is no reason for (1, 2, 3) to depend on each other. High level functions can call 2 or 3, and then pass the result into 1.

EDIT: This also means that ClusteringConfig should not store any featurizer of sorts.

Datseris commented 2 years ago

aaaaarg the reason we have this "featurizer" as part of the ClusterConfig is because of these stupid templates right? Because the templates are initial conditions each leading to a given attractor that you need to evolve and then map into features and then do a knn of the distances of the features to the templates.

Grrr this templating business is soooo weird.

Datseris commented 2 years ago

Well, here is an idea: we can completely drop this template version. If you think about it, the AttractorsViaProximity is a superior version to the template version anyways, no? If you have initial conditions you "know" go to a specific attractor, then you have the attractors. Then you can just take these attractors and use the proximity version?

I am trying to think of scenarios where the proximity version wouldn't work. I guess one case would be the cases of identical coupled oscillators we discussed, where you may want to group different attractors into one because they are "practically the same". But in this case, would it really help you to have the templates? I mean, wouldn't the featurizing and clustering by itself lead to the same result...?

KalelR commented 2 years ago

aaaaarg the reason we have this "featurizer" as part of the ClusterConfig is because of these stupid templates right? Because the templates are initial conditions each leading to a given attractor that you need to evolve and then map into features and then do a knn of the distances of the features to the templates.

featurizer is part of ClusterConfig because it is needed to extract the features before clustering. We do this either when we integrate the trajectories (ics case) or when we receive the trajectories/attractors (new case we're implementing). For the latter, we had decided to use only ClusterConfig and not AttractorsViaFeaturizing, so the featurizer needed to be inside ClusterConfig. So for the case of attractors, the user passes ClusterConfig to cluster_datasets, which extracts the features and clusters them (basically calling items 2 and 1, which you mentioned). For the case of ics, it integrates and extracts (item 3) and then clusterizes (1). Do you think this should be changed?

But indeed, the templates are the issue complicating the design. Because then they need the parameters for the integration (which are in AttractorsViaFeaturizing).

Well, here is an idea: we can completely drop this template version. If you think about it, the AttractorsViaProximity is a superior version to the template version anyways, no? If you have initial conditions you "know" go to a specific attractor, then you have the attractors. Then you can just take these attractors and use the proximity version?

I agree that using the supervised (template) version is just worse than the AttractorsViaProximity, in the case we had implemented so far, where the user is passing the initial conditions for the attractor. I agree we could just remove it, and this would simplify the design a lot.

However, maybe passing templates as features directly, instead of initial conditions, could be useful. For instance in the networks of oscillators, the user may not know the attractors themselves, but only their values of phase synchronization (eg R = 0.2 and R = 0.9). By passing the quantifier for phase synchronization in the features, bSTAB would already cluster the attractors based on this, but would find by itself the values of phase synchronization. So an alternative is to pass the user-known values, then we clusterize based on the distance to these values using kNN. This is very easy to implement with the code we have already implemented. The user would pass attractors_template to ClusterConfig, which would be directly used as templates by the supervised method (which would no longer receive ics to integrate).

Datseris commented 2 years ago

However, maybe passing templates as features directly, instead of initial conditions, could be useful

I agree with this idea and I like it a lot!. Let's allow a template version where the user gives features already as templates. This means we drop the current templating verison that complicates the design so much because of its integration needs.

For the case of ics, it integrates and extracts (item 3) and then clusterizes (1). Do you think this should be changed?

To clarify: the function cluster_datasets has no reason to do any integration. It should take three arguments:

The datasets to be clustered
A featurizer mapping datasets to features
The ClusterConfig structure, which should only have information for the clustering.

AttractorsViaFeaturizing has inside it 2, 3 and also info on how to produce 1 via integration.

I think now we have to face the fact that we cannot use the function cluster_datasets for the AttractorsViaFeaturizing, because the latter doesn't produce all the datasets and then featurizes them. Instead, it produces features and then featurizes them. Hence, this is the function we need:

function cluster_datasets(datasets, featurizer, clusterconfig)
  features = featurizer.(datasets)
  return cluster_features(features, clusterconfig)
end

cluster_features(features, clusterconfig) is the function that is also called by AttractorsViaFeaturizing.

I hope that this clarifies things now. If not, it is probably more efficient for both of us to have a quick call :D

Datseris commented 2 years ago

This isn't ready for review, right? E.g., have you done the change where the templates are already features? (also now we should rename attractor_templates to just templates)

KalelR commented 2 years ago

This isn't ready for review, right? E.g., have you done the change where the templates are already features? (also now we should rename attractor_templates to just templates)

Should be. Yes, cluster_datasets now does two clusterings

Unsupervised: dbscan algorithm, when no templates are passed. Via cluster_features_clustering (maybe cluster_features_unsupervised is a better name?)
Supervised: kNN, templatesare passed as a vector of features. Via cluster_features_distances (maybe cluster_features_supervised is a better name?).

Some tests:

using DynamicalSystems
using OrdinaryDiffEq

# Lorenz

F = 6.886; G = 1.347; a = 0.255; b = 4.0
ds = Systems.lorenz84(; F, G, a, b)
u0s = [
    1 => [2.0, 1, 0], # periodic
    2 => [-2.0, 1, 0], # chaotic
    3 => [0, 1.5, 1.0], # fixed point
]
diffeq = (alg = Vern9(), reltol = 1e-9, abstol = 1e-9)
M = 50; z = 3
M = 10; z = 3
xg = yg = zg = range(-z, z; length = M)
grid = (xg, yg, zg)

function featurizer(A, t)
    g = exp(genentropy(A, 0.1; q = 0))
    return [g, estimate_period(A[:,1], :zc)]
end
clstrspcs = ClusteringConfig(; min_neighbors=1)
mapper = AttractorsViaFeaturizing(ds, featurizer, clstrspcs)
basins, attractors = basins_of_attraction(mapper, grid; show_progress = true)

a1 = attractors[1]
a2 = attractors[2]
a3 = attractors[3]
attractors = Dict(1 => a1, 2=> Dataset(Matrix(a1) .+ rand()), 3=>Dataset(Matrix(a1) .+ rand()), 4=>Dataset(Matrix(a1) .+ rand()), 5=>a2, 6=>Dataset(Matrix(a2) .+ rand()), 7=>Dataset(Matrix(a3) .+ rand()), 8=>Dataset(Matrix(a3) .+ rand())) #[1, 1, 1, 1, 2,2, 3, 3] with noise

## Unsupervised
clstrspcs = ClusteringConfig(; min_neighbors=1)
clust_labels, clust_errors = cluster_datasets(attractors, featurizer, clstrspcs)

## SUper
templates = [featurizer(a, []) for a ∈ [a1, a2, a3]]
clstrspcs = ClusteringConfig(; templates, min_neighbors=1)
clust_labels, clust_errors = cluster_datasets(attractors, featurizer, clstrspcs)

# Artificial
## Unsupervised version 
a1 = Dataset([1 1])
a2 = Dataset([2 2])
a3 = Dataset([3 3])
attractors = Dict(1 => a1, 2=> a1, 3=>a1, 4=>a1, 5=>a2, 6=>a2, 7=>a3, 8=>a3)  #[1,1,1,1,2,2,3,3]
function featurizer(A, t)
    return [minimum(A[:,2]), maximum(A[:,2])]
end
clstrspcs = ClusteringConfig(; min_neighbors=1)
clust_labels, clust_errors = cluster_datasets(attractors, featurizer, clstrspcs)

# attractors = Dict(1 => a1, 2=> a1) #[1,2]
# clust_labels, clust_errors = cluster_datasets(attractors, featurizer, clstrspcs)

## Supervised
templates = [featurizer(a, []) for a ∈ [a1, a2, a3]]
clstrspcs = ClusteringConfig(; templates, min_neighbors=1)
clust_labels, clust_errors = cluster_datasets(attractors, featurizer, clstrspcs)

Datseris commented 2 years ago

This is done as far as I can tell. Only the following things remain:

Re-write tests for the new interface that uses ClusterConfig.
Write one new test explicitly for the cluster_features function.
Add the silhuette information in the ClusterConfig docstring.

@KalelR can you please do these things to wrap things up?

@KalelR I have two requests/tips for you if you please

Please create a branch in the origin of chaos tools instead of using your own fork. That's exactly the reason you are part of the org ;)
Please do not push changes to the source code that are not reflected in the docstirng. Before I took over the docstring said that templates is a dataset with initial conditions but the source was using templates already directly as features.

I realized there is no reason to provide a function cluster_datasets. Simplyt because it is the two-liner:

julia> featurizer(A, t) = [minimum(A[:, 1]), maximum(A[:, 2])]
featurizer (generic function with 1 method)

julia> x = [Dataset(rand(100, 3)) for i in 1:3]
3-element Vector{Dataset{3, Float64}}:
 3-dimensional Dataset{Float64} with 100 points
 3-dimensional Dataset{Float64} with 100 points
 3-dimensional Dataset{Float64} with 100 points

julia> featurizer.(x, 2)
3-element Vector{Vector{Float64}}:
 [0.0032883180061905604, 0.9732324878658828]
 [0.011357202225453333, 0.9866541355882478]
 [0.01981547884948287, 0.9922961052874202]

julia> cluster_features(ans, ClusterConfing(...))

awage commented 2 years ago

Hi! I have been using the function cluster_datasets to cluster attractors. It worked fine. Can you please maintain this function? It seems useful. If not I will rewrite the code.

Datseris commented 2 years ago

cluster_datasets is just

features = [featurizer(d, t) for d in datasets]
cluster_features(features, ...)

so I would still argue that we shouldn't offer this function. All functions exported by the API need to be tested and maintained, and I don't see particular benefit in adding this one to this list given how easy it is to do on the user side.

awage commented 2 years ago

Understood, I just have to replace these two lines. in the code.

KalelR commented 2 years ago

@KalelR can you please do these things to wrap things up? Yes, sure! Sorry, I'd been super busy the last weeks but now I can come back.

@KalelR I have two requests/tips for you if you please Got it, thanks!

KalelR commented 2 years ago

Ok. I pushed the code to my fork and also to a branch in the origin of ChaosTools.

1. Re-write tests for the new interface that uses ClusterConfig: done, but there are still two issues, unfortunately:

First, the basin fractions fs are not matching well for the duffing oscillator for the proximity and recurrences methods. No clue why, I did not change anything related. Possible fix is just to update the expected_fs values.
Second, the test @test all(x -> x ∈ known_ids, bids) does not work for the Henon case in the supervised clustering method, and that makes sense. Basically, the clustering method finds correctly two clusters, one corresponding to the real attractor and another to infinity, and so assigns them labels 1and 2. But we assign a priori label -1 to infinity since we know it's not a real attractor. Then the comparison of course does not match, but nor should it: the clustering algorithm does not know that, it just sees the infinity attractor as a [200, 200] point. We can just label infinity as 1 or 2, or we could implement a function that re-assigns the label for infinity as -1 after the clustering algorithm. This is a bit of work, and I don't see much benefit for the user.

2. Write one new test explicitly for the cluster_features function Done, I wrote an artificial test, as I think it's clearer. But I can also switch to a test using attractors from some system.

3. Add the silhuette information in the ClusterConfig docstring. Done. I added the information to the docstring of optimal_radius_dbscan_silhouette and referred to it in the docstring of ClusteringConfig, as I think then the information becomes more organized.

About the function cluster_datasets, it is just a two-liner, but would it not be convenient for the users to have it available? I wrote it in the artificial test, we could just export it if you agree.

Datseris commented 2 years ago

I get much more errors when I run this on my machine :( :( :(

But something seems to be wrong with the packages I have. I am getting errors in the root finding of Roots.jl for example.

Datseris commented 2 years ago

The way we currently did threading was unsafe. We cannot @threads over the trajectory of an integrator, because we are threading over the same references to the same object. I've removed it for now, it can be fixed after DynamicalSystems.jl 3.0.

Datseris commented 2 years ago

@KalelR everything is fixed now. I have one more request: I'd like you to change the way templates are provided in ClusterConfig. I think they should be like attractors are in AttractorsViaProximity. A dictionary mapping attractor labels to templates.

KalelR commented 2 years ago

Ok, now ClusteringConfigreceives template as a dict of labels to templates. The templates are features, stored as vectors. Internally, we do need to convert the templates to a matrix for kNN. Since this is just for kNN, I left the transformation into a Matrix in the corresponding function, cluster_features_distances, but I can also do that when creating the struct.

The cluster labels are given according to the order of the labels in the templates dictionary.

Datseris commented 2 years ago

The cluster labels are given according to the order of the labels in the templates dictionary.

Wait, you mean, the cluster labels are exactly the same as the template labels, right? In the supervised version, each feature gets the label of the cluster it is nearest to. The label of the cluster is the label given by the templates dict, right?

KalelR commented 2 years ago

The label of the cluster is the label given by the templates dict, right?

Yes, but indirectly. The kNN algorithm receives just a matrix with the templates and a matrix with the features. It then labels the features according to the column (or row, not sure off the top of my head) of the nearest template. So labelling here depends on the ordering of the templates of the matrix.

This matrix is constructed from the templates dict via templates = float.(reduce(hcat, [cluster_specs.templates[i] for i ∈1:length(cluster_specs.templates)] ) ). This orders the columns according to the labels of the dict, from 1 to the number of templates`. So the matrix is ordered according to the labels of the template dict, which need to range from 1 to the number of templates.

This makes it so that the cluster labels are the same as the labels of the template dict. But the dict labels need to be numbers on that range, which is maybe not ideal. If we were to allow arbitrary labels, we would still have to create the matrix, and would need to keep track of which columns of the matrix correspond to each label. This is a bit of work, which might not be worth the benefit.

Datseris commented 2 years ago

the labels of the template dict, which need to range from 1 to the number of templates.

Yeah, this is the problem. We can't assume this. That's exactly why we are using a Dict, so that we can have arbitrary mapping of labels to attractors. We need this to implement the "montecarlo basin bifurcation continuation" or however is called, and also for the optimized method I can up with for getting basin bifurcations with the recurrences methods.

I guess you need to make one more dictionary mapping the indices 1 to lengtgh(templates) to the keys of the templates, and in the last step use that dictionary to remap all labels to the ones enforced by the templates dictionary.

KalelR commented 2 years ago

So that the cluster labels are the template labels given by the user, right?

Datseris commented 2 years ago

exactly

Datseris commented 2 years ago

Also, let's please ensure that the labels can be of arbitrary type. There is no reason to enforce the labels to be integers, they might as well be strings, or anything that can be viably used as a dictionary key type. Might as well put a test for this in the test suite.

KalelR commented 2 years ago

Also, let's please ensure that the labels can be of arbitrary type. There is no reason to enforce the labels to be integers, they might as well be strings, or anything that can be viably used as a dictionary key type. Might as well put a test for this in the test suite.

Yup, ok.

KalelR commented 2 years ago

Example now:

KalelR commented 2 years ago

Can you please put this REPL example in the tests :) Then I'll merge!

It's the example in clustering_tests.jl :)

Datseris commented 2 years ago

@KalelR notice that you have not chaged the documentation string to reflect the changes, templates is still stated as a vector of vectors. Also be sure you move the silhuete stuff!

Datseris commented 2 years ago

(and remember my advice: first change the docstring, then change the function. Since the docstring reflects the public API, it is that that should guide you on what the source code will do)

KalelR commented 2 years ago

@KalelR notice that you have not chaged the documentation string to reflect the changes, templates is still stated as a vector of vectors. Also be sure you move the silhuete stuff!

Yup, sorry. I'll start with the docstring from now on!

KalelR commented 2 years ago

Now I noticed one issue: currently basins_of_attraction fails for non-number labels given in templates, basins is initialized as an Ìnt` matrix. I think we can either:

change basins to an arbitrary-type matrix; in the case of string template labels, it would be a matrix of strings
keep basins as an Int matrix: then we change the labels to numbers and return an additional dictionary with the correspondence between the string and number labels

Or is there a better solution?

KalelR commented 2 years ago

currently basins_of_attraction fails for non-number labels given in templates

Please do notice we still need to fix this. Should be quick to implement, I'd just like to see what you think is the best way.

Datseris commented 2 years ago

Please do notice we still need to fix this. Should be quick to implement, I'd just like to see what you think is the best way.

Ah sorry, can you please open a new issue so we don't forget?

Datseris commented 2 years ago

The solution is to keep basins as an int matrix and do a final replace after the function has finished computing, using a dictionary mapping labels to integers that we create at the start of the funtion. The recurrences method relies heavily on integers I Think.

EDIT: Actually no, just make the basins initialize the Array with keys the type of the dictioanry. The recurrences has key type int so its all good.

JuliaDynamics / ChaosTools.jl

Generic method for clustering attractors and initial conditions via featurizing #259