Closed KalelR closed 2 years ago
Hmm, I see why the design here isn't that good... Because we have AttractorsViaFeaturizing
both be an attractor mapper subtype, but also contain the keywords for clustering. It is really weird to have to provide a dynamical system for clustering datasets, so we need to rethink the design.
In my view, cluster_datasets
should be independent of AttractorMapper
and its subtypes. This means, it should take in an iterator of Datasets
only and should not require an AttractorsViaFeaturizing
instance. (By the way: there is no reason to limit the collection of datasets to a dictionary. any collection should be allowed. I.e., do not specify type).
There are two ways to go about this:
cluster_datasets
have the same keywords as AttractorsViaFeaturizing
.ClusteringSpecs(...)
. Then this is given as a second argument to cluster_datasets
but also as an argument to AttractorsViaFeaturizing
(which now will not have any keywords for the clustering).I prefer the second method much more. First, it allows us to expand the clustering process with other clustering types if need be in the future, and second it removes the duplication of the clustering keyword arguments, that would otherwise have ot be given to both AttractorsViaFeaturizing
and cluster_datasets
.
What do you think? If you agree, do you think you can rework the code in this manner? This means that all clustering functions will no longer take in a AttractorsViaFeaturizing
instance. Rather, they would take the ClusteringSpecs
instance (which itself will be a field of AttractorsViaFeaturizing
).
hm, i'm not sure what to do with the supervised method for cluster_datasets
...
What do you think? If you agree, do you think you can rework the code in this manner? This means that all clustering functions will no longer take in a
AttractorsViaFeaturizing
instance. Rather, they would take theClusteringSpecs
instance (which itself will be a field ofAttractorsViaFeaturizing
).
Yeah, good idea. This is what I did now.
hm, i'm not sure what to do with the supervised method for
cluster_datasets
...
Yeah, the supervised method is currently broken because we need to decide what to do. What I thought was:
For when passing trajectories (when the cluster_datasets
receives only ClusteringSpecs
), the user can pass other attractors as templates. These would be a field attractors_template
in ClusteringSpecs
. The features of these are then extracted and used as templates by kNN
.
For when passing ics (in basin_fractions
which receives AttractorsViaFeaturizer
), the user could either (i) pass trajectories as templates (using the field attractors_template
in ClusteringSpecs
), exactly as in the above case, and using the same code; or (ii) pass initial conditions via attractors_ic
in AttractorsViaFeaturizing
. For the latter, the ics would be integrated and then the resulting trajectories would be added to ClusteringSpecs
to be used for the clustering. This would then be compatible with the previous versions of the code.
But I'm having trouble with this method (ii) now because it needs to modify the already-existing cluster ClusteringSpecs
and change attractors_template
from Nothing
to the integrated trajectories. Just making the struct mutable does not work, because that does not allow for changing the type of the element. Not sure what the best solution is.
(I will update the documentation as soon as we solve this)
Please move the docstring of the function
function ClusteringSpecs(featurizer::Function;
attractors_template::Union{AbstractDataset, Dict, Nothing, Vector}=nothing,
clust_method_norm=Euclidean(), clustering_threshold = 0.0, min_neighbors = 10,
clust_method = clustering_threshold > 0 ? "kNN_thresholded" : "kNN",
rescale_features=true, optimal_radius_method="silhouettes",
)
to be the docstring the type definition, and then move this function directly after the type definition. It makes for much more clear source code. At the moment you need to scroll several 100s of loc to find the docstring of the type ClusteringSpecs
.
Let's rename ClusteringSpecs
to ClusteringConfig
. Most Julia packages use the word "Config" when defining configuration options, so this name is a bit more harmonious.
At the moment the usage of cluster_datasets
is confusing... What does this function do?
function cluster_datasets(dataset::Union{AbstractDataset, Function},
mapper::AttractorsViaFeaturizing; show_progress=true, N=1000)
feature_array = extract_features(mapper, dataset; show_progress, N)
if !isnothing(mapper.attractors_ic)
attractors_template = integrate_ics(mapper, mapper.attractors_ic; show_progress=false)
mapper.cluster_specs.attractors_template = attractors_template #errors: MethodError: Cannot `convert` an object of type Vector{Dataset} to an object of type Nothing
end
class_labels, class_errors = classify_features(feature_array, mapper.cluster_specs)
return class_labels, class_errors
end
It both creates the datasets, and then clusters them? if so, then this is not good design. A function should have a single purpose, and this is the purpose conveyed by the name. We need to split the code so that the code that creates the features is separated by the code that clusters them.
Notice that it's okay that cluster_dataset
maps datasets to features and then clusters them. but generating the datasets should happen in a separate function.
Okay, seems like we need a rework here. The AttractorsViaFeaturizing
does NOT need to store all datasets/trajectories produced. Yet, that's exactly what happens in this function:
function integrate_ics(mapper::AttractorsViaFeaturizing, ics::AbstractDataset;
show_progress = true)
N = size(ics, 1) # number of actual ICs
trajectories = Vector{Dataset}(undef, N)
Threads.@threads for i ∈ 1:N
ic = ChaosTools._get_ic(ics,i)
trajectories[i] = trajectory(mapper.ds, mapper.total, ic;
Ttr = mapper.Ttr, Δt = mapper.Δt, diffeq = mapper.diffeq)
show_progress && ProgressMeter.next!(progress)
end
return trajectories
end
We need to separate code into three parts:
AttractorsViaFeaturizing
wants to do.There is no reason for (1, 2, 3) to depend on each other. High level functions can call 2 or 3, and then pass the result into 1.
EDIT: This also means that ClusteringConfig
should not store any featurizer of sorts.
aaaaarg the reason we have this "featurizer" as part of the ClusterConfig is because of these stupid templates right? Because the templates are initial conditions each leading to a given attractor that you need to evolve and then map into features and then do a knn of the distances of the features to the templates.
Grrr this templating business is soooo weird.
Well, here is an idea: we can completely drop this template version. If you think about it, the AttractorsViaProximity
is a superior version to the template version anyways, no? If you have initial conditions you "know" go to a specific attractor, then you have the attractors. Then you can just take these attractors and use the proximity version?
I am trying to think of scenarios where the proximity version wouldn't work. I guess one case would be the cases of identical coupled oscillators we discussed, where you may want to group different attractors into one because they are "practically the same". But in this case, would it really help you to have the templates? I mean, wouldn't the featurizing and clustering by itself lead to the same result...?
aaaaarg the reason we have this "featurizer" as part of the ClusterConfig is because of these stupid templates right? Because the templates are initial conditions each leading to a given attractor that you need to evolve and then map into features and then do a knn of the distances of the features to the templates.
featurizer
is part of ClusterConfig because it is needed to extract the features before clustering. We do this either when we integrate the trajectories (ics case) or when we receive the trajectories/attractors (new case we're implementing). For the latter, we had decided to use only ClusterConfig
and not AttractorsViaFeaturizing
, so the featurizer
needed to be inside ClusterConfig
. So for the case of attractors, the user passes ClusterConfig
to cluster_datasets
, which extracts the features and clusters them (basically calling items 2 and 1, which you mentioned). For the case of ics, it integrates and extracts (item 3) and then clusterizes (1). Do you think this should be changed?
But indeed, the templates are the issue complicating the design. Because then they need the parameters for the integration (which are in AttractorsViaFeaturizing
).
Well, here is an idea: we can completely drop this template version. If you think about it, the AttractorsViaProximity is a superior version to the template version anyways, no? If you have initial conditions you "know" go to a specific attractor, then you have the attractors. Then you can just take these attractors and use the proximity version?
I agree that using the supervised (template) version is just worse than the AttractorsViaProximity, in the case we had implemented so far, where the user is passing the initial conditions for the attractor. I agree we could just remove it, and this would simplify the design a lot.
However, maybe passing templates as features directly, instead of initial conditions, could be useful. For instance in the networks of oscillators, the user may not know the attractors themselves, but only their values of phase synchronization (eg R = 0.2 and R = 0.9).
By passing the quantifier for phase synchronization in the features, bSTAB would already cluster the attractors based on this, but would find by itself the values of phase synchronization. So an alternative is to pass the user-known values, then we clusterize based on the distance to these values using kNN. This is very easy to implement with the code we have already implemented. The user would pass attractors_template
to ClusterConfig
, which would be directly used as templates by the supervised method (which would no longer receive ics to integrate).
However, maybe passing templates as features directly, instead of initial conditions, could be useful
I agree with this idea and I like it a lot!. Let's allow a template version where the user gives features already as templates. This means we drop the current templating verison that complicates the design so much because of its integration needs.
For the case of ics, it integrates and extracts (item 3) and then clusterizes (1). Do you think this should be changed?
To clarify: the function cluster_datasets
has no reason to do any integration. It should take three arguments:
ClusterConfig
structure, which should only have information for the clustering.AttractorsViaFeaturizing
has inside it 2, 3 and also info on how to produce 1 via integration.
I think now we have to face the fact that we cannot use the function cluster_datasets
for the AttractorsViaFeaturizing
, because the latter doesn't produce all the datasets and then featurizes them. Instead, it produces features and then featurizes them. Hence, this is the function we need:
function cluster_datasets(datasets, featurizer, clusterconfig)
features = featurizer.(datasets)
return cluster_features(features, clusterconfig)
end
cluster_features(features, clusterconfig)
is the function that is also called by AttractorsViaFeaturizing
.
I hope that this clarifies things now. If not, it is probably more efficient for both of us to have a quick call :D
This isn't ready for review, right? E.g., have you done the change where the templates are already features? (also now we should rename attractor_templates
to just templates
)
This isn't ready for review, right? E.g., have you done the change where the templates are already features? (also now we should rename
attractor_templates
to justtemplates
)
Should be. Yes, cluster_datasets
now does two clusterings
templates
are passed. Via cluster_features_clustering
(maybe cluster_features_unsupervised
is a better name?)templates
are passed as a vector of features. Via cluster_features_distances
(maybe cluster_features_supervised
is a better name?).Some tests:
using DynamicalSystems
using OrdinaryDiffEq
# Lorenz
F = 6.886; G = 1.347; a = 0.255; b = 4.0
ds = Systems.lorenz84(; F, G, a, b)
u0s = [
1 => [2.0, 1, 0], # periodic
2 => [-2.0, 1, 0], # chaotic
3 => [0, 1.5, 1.0], # fixed point
]
diffeq = (alg = Vern9(), reltol = 1e-9, abstol = 1e-9)
M = 50; z = 3
M = 10; z = 3
xg = yg = zg = range(-z, z; length = M)
grid = (xg, yg, zg)
function featurizer(A, t)
g = exp(genentropy(A, 0.1; q = 0))
return [g, estimate_period(A[:,1], :zc)]
end
clstrspcs = ClusteringConfig(; min_neighbors=1)
mapper = AttractorsViaFeaturizing(ds, featurizer, clstrspcs)
basins, attractors = basins_of_attraction(mapper, grid; show_progress = true)
a1 = attractors[1]
a2 = attractors[2]
a3 = attractors[3]
attractors = Dict(1 => a1, 2=> Dataset(Matrix(a1) .+ rand()), 3=>Dataset(Matrix(a1) .+ rand()), 4=>Dataset(Matrix(a1) .+ rand()), 5=>a2, 6=>Dataset(Matrix(a2) .+ rand()), 7=>Dataset(Matrix(a3) .+ rand()), 8=>Dataset(Matrix(a3) .+ rand())) #[1, 1, 1, 1, 2,2, 3, 3] with noise
## Unsupervised
clstrspcs = ClusteringConfig(; min_neighbors=1)
clust_labels, clust_errors = cluster_datasets(attractors, featurizer, clstrspcs)
## SUper
templates = [featurizer(a, []) for a ∈ [a1, a2, a3]]
clstrspcs = ClusteringConfig(; templates, min_neighbors=1)
clust_labels, clust_errors = cluster_datasets(attractors, featurizer, clstrspcs)
# Artificial
## Unsupervised version
a1 = Dataset([1 1])
a2 = Dataset([2 2])
a3 = Dataset([3 3])
attractors = Dict(1 => a1, 2=> a1, 3=>a1, 4=>a1, 5=>a2, 6=>a2, 7=>a3, 8=>a3) #[1,1,1,1,2,2,3,3]
function featurizer(A, t)
return [minimum(A[:,2]), maximum(A[:,2])]
end
clstrspcs = ClusteringConfig(; min_neighbors=1)
clust_labels, clust_errors = cluster_datasets(attractors, featurizer, clstrspcs)
# attractors = Dict(1 => a1, 2=> a1) #[1,2]
# clust_labels, clust_errors = cluster_datasets(attractors, featurizer, clstrspcs)
## Supervised
templates = [featurizer(a, []) for a ∈ [a1, a2, a3]]
clstrspcs = ClusteringConfig(; templates, min_neighbors=1)
clust_labels, clust_errors = cluster_datasets(attractors, featurizer, clstrspcs)
This is done as far as I can tell. Only the following things remain:
ClusterConfig
.cluster_features
function.@KalelR can you please do these things to wrap things up?
@KalelR I have two requests/tips for you if you please
origin
of chaos tools instead of using your own fork. That's exactly the reason you are part of the org ;)I realized there is no reason to provide a function cluster_datasets
. Simplyt because it is the two-liner:
julia> featurizer(A, t) = [minimum(A[:, 1]), maximum(A[:, 2])]
featurizer (generic function with 1 method)
julia> x = [Dataset(rand(100, 3)) for i in 1:3]
3-element Vector{Dataset{3, Float64}}:
3-dimensional Dataset{Float64} with 100 points
3-dimensional Dataset{Float64} with 100 points
3-dimensional Dataset{Float64} with 100 points
julia> featurizer.(x, 2)
3-element Vector{Vector{Float64}}:
[0.0032883180061905604, 0.9732324878658828]
[0.011357202225453333, 0.9866541355882478]
[0.01981547884948287, 0.9922961052874202]
julia> cluster_features(ans, ClusterConfing(...))
Hi! I have been using the function cluster_datasets
to cluster attractors. It worked fine. Can you please maintain this function? It seems useful. If not I will rewrite the code.
cluster_datasets
is just
features = [featurizer(d, t) for d in datasets]
cluster_features(features, ...)
so I would still argue that we shouldn't offer this function. All functions exported by the API need to be tested and maintained, and I don't see particular benefit in adding this one to this list given how easy it is to do on the user side.
Understood, I just have to replace these two lines. in the code.
@KalelR can you please do these things to wrap things up? Yes, sure! Sorry, I'd been super busy the last weeks but now I can come back.
@KalelR I have two requests/tips for you if you please Got it, thanks!
Ok. I pushed the code to my fork and also to a branch in the origin of ChaosTools.
1. Re-write tests for the new interface that uses ClusterConfig: done, but there are still two issues, unfortunately:
fs
are not matching well for the duffing oscillator for the proximity
and recurrences
methods. No clue why, I did not change anything related. Possible fix is just to update the expected_fs
values.
@test all(x -> x ∈ known_ids, bids)
does not work for the Henon case in the supervised clustering method, and that makes sense. Basically, the clustering method finds correctly two clusters, one corresponding to the real attractor and another to infinity, and so assigns them labels 1
and 2
. But we assign a priori label -1
to infinity since we know it's not a real attractor. Then the comparison of course does not match, but nor should it: the clustering algorithm does not know that, it just sees the infinity attractor as a [200, 200]
point. We can just label infinity as 1
or 2
, or we could implement a function that re-assigns the label for infinity as -1
after the clustering algorithm. This is a bit of work, and I don't see much benefit for the user. 2. Write one new test explicitly for the cluster_features function Done, I wrote an artificial test, as I think it's clearer. But I can also switch to a test using attractors from some system.
3. Add the silhuette information in the ClusterConfig docstring.
Done. I added the information to the docstring of optimal_radius_dbscan_silhouette
and referred to it in the docstring of ClusteringConfig
, as I think then the information becomes more organized.
About the function cluster_datasets
, it is just a two-liner, but would it not be convenient for the users to have it available? I wrote it in the artificial test, we could just export it if you agree.
I get much more errors when I run this on my machine :( :( :(
But something seems to be wrong with the packages I have. I am getting errors in the root finding of Roots.jl for example.
The way we currently did threading was unsafe. We cannot @threads
over the trajectory of an integrator, because we are threading over the same references to the same object. I've removed it for now, it can be fixed after DynamicalSystems.jl 3.0.
@KalelR everything is fixed now. I have one more request: I'd like you to change the way templates
are provided in ClusterConfig
. I think they should be like attractors
are in AttractorsViaProximity
. A dictionary mapping attractor labels to templates.
Ok, now ClusteringConfig
receives template as a dict of labels to templates. The templates are features, stored as vectors. Internally, we do need to convert the templates to a matrix for kNN
. Since this is just for kNN, I left the transformation into a Matrix in the corresponding function, cluster_features_distances
, but I can also do that when creating the struct.
The cluster labels are given according to the order of the labels in the templates
dictionary.
The cluster labels are given according to the order of the labels in the templates dictionary.
Wait, you mean, the cluster labels are exactly the same as the template labels, right? In the supervised version, each feature gets the label of the cluster it is nearest to. The label of the cluster is the label given by the templates
dict, right?
The label of the cluster is the label given by the templates dict, right?
Yes, but indirectly. The kNN
algorithm receives just a matrix with the templates and a matrix with the features. It then labels the features according to the column (or row, not sure off the top of my head) of the nearest template. So labelling here depends on the ordering of the templates of the matrix.
This matrix is constructed from the templates dict via templates = float.(reduce(hcat, [cluster_specs.templates[i] for i ∈1:length(cluster_specs.templates)] ) )
. This orders the columns according to the labels of the dict, from 1 to the number of templates`. So the matrix is ordered according to the labels of the template dict, which need to range from 1 to the number of templates.
This makes it so that the cluster labels are the same as the labels of the template dict. But the dict labels need to be numbers on that range, which is maybe not ideal. If we were to allow arbitrary labels, we would still have to create the matrix, and would need to keep track of which columns of the matrix correspond to each label. This is a bit of work, which might not be worth the benefit.
the labels of the template dict, which need to range from 1 to the number of templates.
Yeah, this is the problem. We can't assume this. That's exactly why we are using a Dict, so that we can have arbitrary mapping of labels to attractors. We need this to implement the "montecarlo basin bifurcation continuation" or however is called, and also for the optimized method I can up with for getting basin bifurcations with the recurrences methods.
I guess you need to make one more dictionary mapping the indices 1 to lengtgh(templates) to the keys of the templates, and in the last step use that dictionary to remap all labels to the ones enforced by the templates dictionary.
So that the cluster labels are the template labels given by the user, right?
exactly
Also, let's please ensure that the labels can be of arbitrary type. There is no reason to enforce the labels to be integers, they might as well be strings, or anything that can be viably used as a dictionary key type. Might as well put a test for this in the test suite.
Also, let's please ensure that the labels can be of arbitrary type. There is no reason to enforce the labels to be integers, they might as well be strings, or anything that can be viably used as a dictionary key type. Might as well put a test for this in the test suite.
Yup, ok.
Example now:
Can you please put this REPL example in the tests :) Then I'll merge!
It's the example in clustering_tests.jl
:)
@KalelR notice that you have not chaged the documentation string to reflect the changes, templates is still stated as a vector of vectors. Also be sure you move the silhuete stuff!
(and remember my advice: first change the docstring, then change the function. Since the docstring reflects the public API, it is that that should guide you on what the source code will do)
@KalelR notice that you have not chaged the documentation string to reflect the changes, templates is still stated as a vector of vectors. Also be sure you move the silhuete stuff!
Yup, sorry. I'll start with the docstring from now on!
Now I noticed one issue: currently basins_of_attraction
fails for non-number labels given in templates
, basins
is initialized as an Ìnt` matrix. I think we can either:
basins
to an arbitrary-type matrix; in the case of string template labels, it would be a matrix of stringsbasins
as an Int
matrix: then we change the labels to numbers and return an additional dictionary with the correspondence between the string and number labels Or is there a better solution?
currently basins_of_attraction fails for non-number labels given in templates
Please do notice we still need to fix this. Should be quick to implement, I'd just like to see what you think is the best way.
Please do notice we still need to fix this. Should be quick to implement, I'd just like to see what you think is the best way.
Ah sorry, can you please open a new issue so we don't forget?
The solution is to keep basins as an int matrix and do a final replace after the function has finished computing, using a dictionary mapping labels to integers that we create at the start of the funtion. The recurrences method relies heavily on integers I Think.
EDIT: Actually no, just make the basins initialize the Array with keys the type of the dictioanry. The recurrences has key type int so its all good.
Idea is to implement a function that can be used (i) by the user to cluster attractors and (ii) by bSTAB. There's some code repetition because the initial conditions need to be integrated and have their features extracted in the same loop (so we don't need to keep all trajectories), but the attractors of course don't. So the path for both types is (i): initial conditions -> integrate into trajectories and extract into features in the same loop -> clusterize (ii): attractors (or trajectories in general that are already saved) -> extract into features -> clusterize
So the only part that is used identically by both is the clusterizing.
This is still not finished, but I decided to open the PR already to ask for your advice. What I did so far is:
cluster_datasets
and the accompanyingextract_features
. The latter works exactly as in the case for receiving initial conditions, except of course it does not integrate any trajectories.cluster_datasets
inbasins_fractions
basins_fraction
just to explain whatN
is (unrelated to the PR itself, but helpful in the code).classify_features_clustering
for the case the number of features (ie of ics or of attractors) is smaller than the dimension of the feature space (ie how many features are calculated). This is because dbscan errors otherwise, and is not necessarily only for the case for here where we input attractors. It could also be a source of error for inputing ics. It's a very unlikely case, and I only noticed it could happen now while testing this new function.The problem is that, currently, the function receives the mapper
AttractorsViaFeaturizing
, which requires a dynamical system. But requiring a dynamical system makes no sense when we're inputting just the attractors, as we just want to featurize then cluster them, not integrate anything. An exception would be if the user wants the supervised method, and would additionally to the attractors also input ics as templates for the clustering. Then the algorithm would integrate these initial conditions and compare them with the input attractors, and therefore the dynamical system would of course be required. But for the unsupervised case I'm not sure what to do. We couldAttractorsViaFeaturizing
so that it can be created without inputing a dynamical system (doesn't seem intuitive)ClustersViaFeaturizing
or something (some more code repetition)Or something better that I haven't though. What do you think?
I also haven't tested the current code extensively because
attractor_mapping_tests
are failing even in the master branch for me. Is this also happening on other machines?