Closed KalelR closed 2 years ago
I made some tests for the different test systems, comparing
The systems were
The figures are here. I can of course also share the code.
The problems seem to arise only with Lorenz84. They seem ultimately to be due to our estimation of optimal_radius
. There,
optimal_radius
be large in comparison to the interval spanned by the smaller interval. As a consequence, the differences in parameter space between this smaller-interval feature get ignored. If this is the feature that separates clusters, then the two clusters will be grouped together. In the Lorenz system, the optimal_radius
jumps from 2 to 4 when 5000 ics are used (the attractor entropy reaches higher values). This jump is big enough for the two clusters to become grouped, apparently.
When we normalize, the two features contribute more equally, and this problem does not seem to occur.optimal_radius
, instead of using the minimum value. Apparently, with the minimum value, optimal_radius
is estimated as too high, making clusters group together. If average, the smaller values keep the clusters separated. There is quite a big interval of optimal_radius
which leads to 3 clusters. The estimation via minimum leads to quite a large value. For other systems (2-5), all options work well. The only observation for the other systems is that increasing to 5000 ics changes the fraction fs
values, which is expected. But options 1-2 lead to the same values.
So it seems that to solve the Lorenz84 problems we can use an average with normalization. Instead of normalizing features, it might be better to use another metric that solves the Euclidian distance issue. What do you think?
If this is ok, I can update the docstring and the tests tomorrow.
For non-normalized features, optimal radius increases when increasing number of features.
This is very confusing for me. How can it be...? The optimal radius should depend on the distribution of distances in the feature space, so why would this change when increasing the amount of points? It should only be more precisely identified, but I must admit it is confusing to me that this optimal radius increases.
This is very confusing for me. How can it be...? The optimal radius should depend on the distribution of distances in the feature space, so why would this change when increasing the amount of points? It should only be more precisely identified, but I must admit it is confusing to me that this optimal radius increases.
Yeah, I should have been more precise: it is not always the case that increasing the number of points will increase the optimal radius. In fact, for the other systems I tested, in which the clustering is easier, this did not occur. This seems to occur in this more complicated clustering scenario of Lorenz.
I think the reason is ultimately because of the differences between the scales of the features, and the use of Euclidian distance for both dbscan itself and for the silhouette values. It becomes then difficult/non-intuitive to understand what is going on for non-normalized features. I added some figures to understand this here, including the average silhouette for each tested radii. I can explain them in more detail if you want, but I'll give the short version . We can see the problem in this figure:
1000 points lead to a decent clustering, with 3 clusters; 5000 points groups the two clusters together. This occurs at a slightly bigger value of optimal_radius only, and the silhouette values are actually slightly bigger (indicating "better" clustering) in this case. Two changes occur when we go to 5000: the top cluster is grouped together; and more points, spread in the x-axis, are added to the bottom right cluster. Since the x-axis spans a much larger interval, and the silhouette compares Euclidian distances, this addition of points spread in the x-axis overweights the addition of all the top cluster points, which are separated in the y-axis, whose distance is not that big. Adding a few points, spread over the large-spanning axis, can overweight adding lots of points spread over the small-spanning axis.
This is not a problem if the features span a similar interval, eg if we normalize them
Notice that increasing to 5000 here even decreases optimal_radius
(it's not very important why, but I think it's because for 1000 there is a semi-isolated point in the bottom right cluster, which needs a bigger radius to be reached; for 5000 there are points points nearby, so the radius can be smaller).
Alright, I understand. Okay, let's go ahead and normalize the features to [0, 1].
By the way, are you sure the reported problems are not coming from the incredibly large spread of the cluster in the bottom left? How do you produce this cluster? All four possiblities I used for featurizing did not have any spread for the cluster corresponding to the fixed point. That's the four functions I used:
function feat1(A, t)
x = A[:, 1]
[minimum(x), std(x)]
end
function feat2(A, t)
x, y, z = columns(A)
[std(x), std(y)*std(z)]
end
function feat3(A, t)
g = exp(genentropy(A, 0.1; q = 0))
x = minimum(A[:,1])
return [g,x]
end
function feat4(A, t)
g = exp(genentropy(A, 0.1; q = 0))
p = estimate_period(A[:,1], :zc)
p = isnan(p) ? 0 : p
return [g,p]
end
Also, this discussion gave me an idea for an alternative algorithm for getting the optimal radius. Perhaps if you are available @KalelR , we should discuss this via a short video call?
By the way, are you sure the reported problems are not coming from the incredibly large spread of the cluster in the bottom left? How do you produce this cluster? All four possiblities I used for featurizing did not have any spread for the cluster corresponding to the fixed point. That's the four functions I used:
I used feat4
actually. My code is here.
The images are:
and
Do you get one point even if the number of initial conditions is increased?
Also, this discussion gave me an idea for an alternative algorithm for getting the optimal radius. Perhaps if you are available @KalelR , we should discuss this via a short video call?
Nice, sure! I'm available today anytime now.
I am available today at 7pm-8pm CET or so. Will send you a zoom link.
Hey @Datseris I need to meet my advisor at 19:30pm now. Can we shift the conversation to tomorrow afternoon?
no problem. unfortunately i am fully booked tomorrow and friday, but no stress, we can see for sunday otherwise next week! BTW, do you use the Julia Slack? We can chat like that there to not overburden the discussion here with off topic stuff. Otherwise we can switch to emails!
Oh, let's meet on sunday or next week then. Sure, I'll start to use Slack :)
I used
feat4
actually. My code is here. The images are:
Hi, very nice job! If I remember well, the cluster on the left correspond to a fixed point attractor. This is why the estimation of the period fails in feat4
and you have so much spread in the values. It is quite tough to find a feature function for everything.
So, normalizing the features seems to make the algorithm work fine, though not perfectly: it seems to always find 3 attractors in Lorenz84, but also sometimes mis-identifies some points as outliers (especially those due to the fixed point). Sometimes it perfectly identifies the clusters, but it's hard to tell when this occurs.
I tested two other possibilities for improving the algorithm:
optimal_radius
by some a priori method, without having to optimize it by the iterative method we are using. optimal_radius
. But none are as good as the current version. The results are here under "Several seeds for normalized and with average". I describe the methods below also.
Described clearly here: https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd, and in "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise" and "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN".
The idea is to define a k = minpts, then find for each point the average distance to its k-nearest neighbors. Plot the sorted distances, and find the point of highest derivative. Optimal epsilon will be the distance at that point. This is the red dot in the figure below.
The method seems to work ok, but not as well as the iterative silhouette method. It is much faster, though. For some reason, the problem is that that the optimal radius (the distance) is too small, and so there usually are quite a few outliers found.
There is this HDBSCAN algorithm that builds on DBSCAN. I didn’t take the time yet to really understand it, but it is supposedly an implementation of DBSCAN that considers several radii. Ultimately, it doesnt need the radius as a parameter. Because of this, it is very good at identifying clusters of varying density (which may not necessarily be a good feature for our purposes).
The important parameter is then basically the minimum amount of neighbors. For any value of minimum amount of neighbors that I tested the results were terrible, the worst clustering so far, finding so many groups. This I think is because this algorithm finds clusters in data of varying density. Since we dont have an enormous amount of points, some of the features become isolated, either by themselves or in small groups. And HDBSCAN considers them as clusters.
Fortunately, there is a recent paper (https://arxiv.org/abs/1911.02282) that approaches this problem by doing a hybrid approach with DBSCAN: apply HDBSCAN and then group together any clusters that are within a certain distance. I therefore applied that, with the distance being the optimal radius found using the elbow method. This improves the method a lot! Downside of course is we get the epsilon parameter back, though it is not as relevant as before, in a sense.
Applying the same iterative optimization we used before for the minimum amount of neighbors makes the method quite good. But I'm still not sure if better than the original DBSCAN with epsilon optimization. Maybe for another system.
(quickly letting you know that I did eye surgery and will take some time until I can see code again... But I have saved this in my todo :) )
(quickly letting you know that I did eye surgery and will take some time until I can see code again... But I have saved this in my todo :) )
Oh, hope everything is ok! Yes, we can resume work on this when you can, no hurry. Wish you a good recovery!
i'm back, i've sent a message on slack.
Now the code should be working. Changes were basically
Test summary is now on my machine:
There are some further improvements that could be made:
Clustering.jl
code into the clustering folder. I tried doing it, but dbscan was failing and at the time I didn't want to spend too long trying to fix it. optimal_radius
. Currently, I think the algorithm would fail if the system has one fixed point (may be trivial, but the code should work) because of the way the radii are guessed: feat_ranges = maximum(features, dims=2)[:,1] .- minimum(features, dims=2)[:,1];
ϵ_grid = range(minimum(feat_ranges)/200, minimum(feat_ranges), length=200)
Also, we are using 200 guesses. Maybe a smarter way could use fewer and speed up the code (which can take some time if there are lots (>1000) of initial conditions). I am guessing that an idea based on the elbow method, or on George's idea, could work.
wait the proximity and recurrences tests were passing before :D what happened now?
wait the proximity and recurrences tests were passing before :D what happened now?
Hmm they weren't for me. I pulled from master
and ran the test before modifying anything and these errors on the duffing oscillators were there already. Just ran the tests again on master
to confirm. The modifications I made shouldn't affect the other methods anyway.
Are there modifications outside master
that I missed?
I confirm that the tests also fail on master in my machine. I will look into it.
It seems that the expected basin fractions values are slightly different on my machine: (0.509, 0.491) instead of (0.511, 0.489). It can be due to the RNG. I don't know why this happens but the temporary fix is this in the Duffing test:
expected_fs_raw = Dict(2 => 0.509, 1 => 0.491)
Now all the tests pass on my machine. I don't know why the supervised method is failing in the tests here. There is something different about the knn algorithm.. maybe the version is different?
BoundsError: attempt to access 3-element Vector{Int64} at index [-1]
seems to be the error. Perhaps we should modify the code so that it skips keys -1
in case teh clustering fails...?
BoundsError: attempt to access 3-element Vector{Int64} at index [-1]
seems to be the error. Perhaps we should modify the code so that it skips keys-1
in case teh clustering fails...?
But the error is occurring in knn
, not the tests:
BoundsError: attempt to access 3-element Vector{Int64} at index [-1]
Stacktrace:
[1] getindex
@ ./array.jl:861 [inlined]
[2] knn_point!(tree::KDTree{SVector{2, Float64}, Euclidean, Float64}, point::SVector{2, Float64}, sortres::Bool, dist::Vector{Float64}, idx::Vector{Int64}, skip::typeof(NearestNeighbors.always_false))
@ NearestNeighbors ~/.julia/packages/NearestNeighbors/YCcEC/src/knn.jl:36
Tests pass, so this is good to go, right?
Tests pass, so this is good to go, right?
Should be, but I haven't touched the supervised/knn part, so there seems to be some variability in that error. It's strange because it never ocurred on my machine.
Alright, but you still need to:
Note that the changelog is for the users, so you don't have to mention changes in the tests there :D (its okay now)
Add changes improving optimal_radius_dbscan; there seem to be no more errors now, and estimation of attractors (clusters) is better.
Worked well for Lorenz84 (though seems to be better if we normalize the features), giving the basins_fractions near the correct ones (the exact value depends on transient, how many ics, which features, whether they are normalized or not...). But it did not work well if we restrict the initial conditions to be around the FP only. It should just give all features in the FP, but that only works if all features are identical. The period estimator feature gives some dispersion on the features, which leads to several clusters being found, instead of one. I think this is the fault of two parts: the current guess of the optimal_radius is not good for that, and the silhouette method is a bit biased towards finding more than 1 cluster.
Worked for Henon also.
The tests on
attractor_mapping_tests.jl
are not always passing, as the algorithm is not finding the correct values perfectly, and sometimes finds some -1 outliers.I made 3 changes. To explain it, first let me remind the idea behind the algorithm. The way 'optimal_radius_dbscan' works is by first defining a range of possible radii for DBSCAN, then iterating on that range to calculate the quality of the clustering for each value, and then later choosing the radius with best clustering. The clustering quality is assessed by the silhouette quantifier.
From Wikipedia: "The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters." It is undefined for only one cluster, but the default value they use is 0 (the midpoint).
The original way the authors approached this was to identify the radius that maximized the minimum silhouette (increased the lower bound the most). This seems very reasonable, but I found that on the Lorenz84 is led to very high 'optimal_radius', which made it group two clusters together. I changed it to instead maximize the average silhouette value, which made the 'optimal_radius' values go down, and improved the clustering (now it finds 3 clusters, instead of two, much more robustly).
Also, in the previous version the algorithm was ignoring the 0-value silhouettes when calculating the minimum. I admit I can't remember why I wrote it that way. The authors of bSTAB did not implement it in their code either. So I removed it. This fixes the errors that were happening (because sometimes all silhouette values were 0, and the code broke).
I also changed the value assigned to the silhouette when only one cluster is found. Previously, it was -2 (ignored the one-cluster solution). Now it is 0 (as the default in Wikipedia), so it seems to me it is more fair. But we might need to discuss this.
So there is still more to do: the algorithm can still be improved, and I'd still like to run more tests, and of course getting
attractor_mapping_tests.jl
to pass. But this is the current version so far, which is already better than before! I can return to this tomorrow.