Improvement/fix for optimal_radius_dbscan (WIP)

KalelR commented 2 years ago

Add changes improving optimal_radius_dbscan; there seem to be no more errors now, and estimation of attractors (clusters) is better.

Worked well for Lorenz84 (though seems to be better if we normalize the features), giving the basins_fractions near the correct ones (the exact value depends on transient, how many ics, which features, whether they are normalized or not...). But it did not work well if we restrict the initial conditions to be around the FP only. It should just give all features in the FP, but that only works if all features are identical. The period estimator feature gives some dispersion on the features, which leads to several clusters being found, instead of one. I think this is the fault of two parts: the current guess of the optimal_radius is not good for that, and the silhouette method is a bit biased towards finding more than 1 cluster.

Worked for Henon also.

The tests on attractor_mapping_tests.jl are not always passing, as the algorithm is not finding the correct values perfectly, and sometimes finds some -1 outliers.

I made 3 changes. To explain it, first let me remind the idea behind the algorithm. The way 'optimal_radius_dbscan' works is by first defining a range of possible radii for DBSCAN, then iterating on that range to calculate the quality of the clustering for each value, and then later choosing the radius with best clustering. The clustering quality is assessed by the silhouette quantifier.

From Wikipedia: "The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters." It is undefined for only one cluster, but the default value they use is 0 (the midpoint).

The original way the authors approached this was to identify the radius that maximized the minimum silhouette (increased the lower bound the most). This seems very reasonable, but I found that on the Lorenz84 is led to very high 'optimal_radius', which made it group two clusters together. I changed it to instead maximize the average silhouette value, which made the 'optimal_radius' values go down, and improved the clustering (now it finds 3 clusters, instead of two, much more robustly).

Also, in the previous version the algorithm was ignoring the 0-value silhouettes when calculating the minimum. I admit I can't remember why I wrote it that way. The authors of bSTAB did not implement it in their code either. So I removed it. This fixes the errors that were happening (because sometimes all silhouette values were 0, and the code broke).

I also changed the value assigned to the silhouette when only one cluster is found. Previously, it was -2 (ignored the one-cluster solution). Now it is 0 (as the default in Wikipedia), so it seems to me it is more fair. But we might need to discuss this.

So there is still more to do: the algorithm can still be improved, and I'd still like to run more tests, and of course getting attractor_mapping_tests.jl to pass. But this is the current version so far, which is already better than before! I can return to this tomorrow.

KalelR commented 2 years ago

I made some tests for the different test systems, comparing

average vs minimum as criterion for finding the optimal radius
normalizing or not the features (by normalization I mean making each feature range from 0 to 1)
comparing 1000 or 5000 initial conditions (therefore, 1000 or 5000 features)

The systems were

Lorenz84
Henon
Duffing
Magnetic pendulum
Thomas cyclical

The figures are here. I can of course also share the code.

The problems seem to arise only with Lorenz84. They seem ultimately to be due to our estimation of optimal_radius. There,

Algorithm always identifies 3 attractors when 1000 ics are given. But increasing to 5000 leads to it estimating 2 attractors, because the optimal radius increases and it starts to group two clusters (chaotic and periodic) together.
Normalizing the features seems to solve this issue, and the 3 clusters continue to be identified for 5000 ics. The reason I see for this is the following: The clustering algorithm calculates the Euclidian distance between points in the clusters and compares that to a threshold, which is what we are calling the optimal radius. For the two features to be important in the comparison, both need to span a similar interval. Otherwise, if the features range very different intervals (eg the period ranges is in ~ [5,10] but the attractor entropy is in ~ [0, 100]), the Euclidian distance will be impacted strongly by the larger interval, making the optimal_radius be large in comparison to the interval spanned by the smaller interval. As a consequence, the differences in parameter space between this smaller-interval feature get ignored. If this is the feature that separates clusters, then the two clusters will be grouped together. In the Lorenz system, the optimal_radius jumps from 2 to 4 when 5000 ics are used (the attractor entropy reaches higher values). This jump is big enough for the two clusters to become grouped, apparently. When we normalize, the two features contribute more equally, and this problem does not seem to occur.
The normalization only works if we use maximize the average between silhouettes to obtain the optimal_radius, instead of using the minimum value. Apparently, with the minimum value, optimal_radius is estimated as too high, making clusters group together. If average, the smaller values keep the clusters separated. There is quite a big interval of optimal_radius which leads to 3 clusters. The estimation via minimum leads to quite a large value.

For other systems (2-5), all options work well. The only observation for the other systems is that increasing to 5000 ics changes the fraction fs values, which is expected. But options 1-2 lead to the same values.

So it seems that to solve the Lorenz84 problems we can use an average with normalization. Instead of normalizing features, it might be better to use another metric that solves the Euclidian distance issue. What do you think?

If this is ok, I can update the docstring and the tests tomorrow.

Datseris commented 2 years ago

For non-normalized features, optimal radius increases when increasing number of features.

This is very confusing for me. How can it be...? The optimal radius should depend on the distribution of distances in the feature space, so why would this change when increasing the amount of points? It should only be more precisely identified, but I must admit it is confusing to me that this optimal radius increases.

KalelR commented 2 years ago

This is very confusing for me. How can it be...? The optimal radius should depend on the distribution of distances in the feature space, so why would this change when increasing the amount of points? It should only be more precisely identified, but I must admit it is confusing to me that this optimal radius increases.

Yeah, I should have been more precise: it is not always the case that increasing the number of points will increase the optimal radius. In fact, for the other systems I tested, in which the clustering is easier, this did not occur. This seems to occur in this more complicated clustering scenario of Lorenz.

I think the reason is ultimately because of the differences between the scales of the features, and the use of Euclidian distance for both dbscan itself and for the silhouette values. It becomes then difficult/non-intuitive to understand what is going on for non-normalized features. I added some figures to understand this here, including the average silhouette for each tested radii. I can explain them in more detail if you want, but I'll give the short version . We can see the problem in this figure:

1000 points lead to a decent clustering, with 3 clusters; 5000 points groups the two clusters together. This occurs at a slightly bigger value of optimal_radius only, and the silhouette values are actually slightly bigger (indicating "better" clustering) in this case. Two changes occur when we go to 5000: the top cluster is grouped together; and more points, spread in the x-axis, are added to the bottom right cluster. Since the x-axis spans a much larger interval, and the silhouette compares Euclidian distances, this addition of points spread in the x-axis overweights the addition of all the top cluster points, which are separated in the y-axis, whose distance is not that big. Adding a few points, spread over the large-spanning axis, can overweight adding lots of points spread over the small-spanning axis.

This is not a problem if the features span a similar interval, eg if we normalize them

Notice that increasing to 5000 here even decreases optimal_radius (it's not very important why, but I think it's because for 1000 there is a semi-isolated point in the bottom right cluster, which needs a bigger radius to be reached; for 5000 there are points points nearby, so the radius can be smaller).

Datseris commented 2 years ago

Alright, I understand. Okay, let's go ahead and normalize the features to [0, 1].

By the way, are you sure the reported problems are not coming from the incredibly large spread of the cluster in the bottom left? How do you produce this cluster? All four possiblities I used for featurizing did not have any spread for the cluster corresponding to the fixed point. That's the four functions I used:

function feat1(A, t)
    x = A[:, 1]
    [minimum(x), std(x)]
end
function feat2(A, t)
    x, y, z = columns(A)
    [std(x), std(y)*std(z)]
end
function feat3(A, t)
    g = exp(genentropy(A, 0.1; q = 0))
    x = minimum(A[:,1])
    return [g,x]
end
function feat4(A, t)
    g = exp(genentropy(A, 0.1; q = 0))
    p = estimate_period(A[:,1], :zc)
    p = isnan(p) ? 0 : p
    return [g,p]
end

Also, this discussion gave me an idea for an alternative algorithm for getting the optimal radius. Perhaps if you are available @KalelR , we should discuss this via a short video call?

KalelR commented 2 years ago

By the way, are you sure the reported problems are not coming from the incredibly large spread of the cluster in the bottom left? How do you produce this cluster? All four possiblities I used for featurizing did not have any spread for the cluster corresponding to the fixed point. That's the four functions I used:

I used feat4 actually. My code is here. The images are:

and

Do you get one point even if the number of initial conditions is increased?

Also, this discussion gave me an idea for an alternative algorithm for getting the optimal radius. Perhaps if you are available @KalelR , we should discuss this via a short video call?

Nice, sure! I'm available today anytime now.

Datseris commented 2 years ago

I am available today at 7pm-8pm CET or so. Will send you a zoom link.

KalelR commented 2 years ago

Hey @Datseris I need to meet my advisor at 19:30pm now. Can we shift the conversation to tomorrow afternoon?

Datseris commented 2 years ago

no problem. unfortunately i am fully booked tomorrow and friday, but no stress, we can see for sunday otherwise next week! BTW, do you use the Julia Slack? We can chat like that there to not overburden the discussion here with off topic stuff. Otherwise we can switch to emails!

KalelR commented 2 years ago

Oh, let's meet on sunday or next week then. Sure, I'll start to use Slack :)

awage commented 2 years ago

I used feat4 actually. My code is here. The images are:

Hi, very nice job! If I remember well, the cluster on the left correspond to a fixed point attractor. This is why the estimation of the period fails in feat4 and you have so much spread in the values. It is quite tough to find a feature function for everything.

KalelR commented 2 years ago

So, normalizing the features seems to make the algorithm work fine, though not perfectly: it seems to always find 3 attractors in Lorenz84, but also sometimes mis-identifies some points as outliers (especially those due to the fixed point). Sometimes it perfectly identifies the clusters, but it's hard to tell when this occurs.

I tested two other possibilities for improving the algorithm:

Finding the optimal_radius by some a priori method, without having to optimize it by the iterative method we are using.
Using another algorithm that does not require an optimal_radius.

But none are as good as the current version. The results are here under "Several seeds for normalized and with average". I describe the methods below also.

Method 1 (elbow method)

Described clearly here: https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd, and in "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise" and "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN".

The idea is to define a k = minpts, then find for each point the average distance to its k-nearest neighbors. Plot the sorted distances, and find the point of highest derivative. Optimal epsilon will be the distance at that point. This is the red dot in the figure below.

The method seems to work ok, but not as well as the iterative silhouette method. It is much faster, though. For some reason, the problem is that that the optimal radius (the distance) is too small, and so there usually are quite a few outliers found.

Method 2 (HDBSCAN)

There is this HDBSCAN algorithm that builds on DBSCAN. I didn’t take the time yet to really understand it, but it is supposedly an implementation of DBSCAN that considers several radii. Ultimately, it doesnt need the radius as a parameter. Because of this, it is very good at identifying clusters of varying density (which may not necessarily be a good feature for our purposes).

The important parameter is then basically the minimum amount of neighbors. For any value of minimum amount of neighbors that I tested the results were terrible, the worst clustering so far, finding so many groups. This I think is because this algorithm finds clusters in data of varying density. Since we dont have an enormous amount of points, some of the features become isolated, either by themselves or in small groups. And HDBSCAN considers them as clusters.

Fortunately, there is a recent paper (https://arxiv.org/abs/1911.02282) that approaches this problem by doing a hybrid approach with DBSCAN: apply HDBSCAN and then group together any clusters that are within a certain distance. I therefore applied that, with the distance being the optimal radius found using the elbow method. This improves the method a lot! Downside of course is we get the epsilon parameter back, though it is not as relevant as before, in a sense.

Applying the same iterative optimization we used before for the minimum amount of neighbors makes the method quite good. But I'm still not sure if better than the original DBSCAN with epsilon optimization. Maybe for another system.

Datseris commented 2 years ago

(quickly letting you know that I did eye surgery and will take some time until I can see code again... But I have saved this in my todo :) )

KalelR commented 2 years ago

(quickly letting you know that I did eye surgery and will take some time until I can see code again... But I have saved this in my todo :) )

Oh, hope everything is ok! Yes, we can resume work on this when you can, no hurry. Wish you a good recovery!

Datseris commented 2 years ago

i'm back, i've sent a message on slack.

Datseris commented 2 years ago

https://github.com/JuliaDynamics/TimeseriesSurrogates.jl/blob/master/src/methods/large_shuffle.jl#L132-L170

KalelR commented 2 years ago

Now the code should be working. Changes were basically

Minimum -> average (slight improvement). Now we choose radius that maximizes the average silhouette.
Rescale features (big improvement)
add elbow method as optional way to find radius. Looking at the original bSTAB implementation, the authors also tried this method, but apparently found that it also did not work very well.
fix tests

Test summary is now on my machine:

There are some further improvements that could be made:

Add the Clustering.jl code into the clustering folder. I tried doing it, but dbscan was failing and at the time I didn't want to spend too long trying to fix it.
More importantly, improve the guesses for the optimal_radius. Currently, I think the algorithm would fail if the system has one fixed point (may be trivial, but the code should work) because of the way the radii are guessed:

    feat_ranges = maximum(features, dims=2)[:,1] .- minimum(features, dims=2)[:,1];
    ϵ_grid = range(minimum(feat_ranges)/200, minimum(feat_ranges), length=200)

Also, we are using 200 guesses. Maybe a smarter way could use fewer and speed up the code (which can take some time if there are lots (>1000) of initial conditions). I am guessing that an idea based on the elbow method, or on George's idea, could work.

Datseris commented 2 years ago

wait the proximity and recurrences tests were passing before :D what happened now?

KalelR commented 2 years ago

wait the proximity and recurrences tests were passing before :D what happened now?

Hmm they weren't for me. I pulled from master and ran the test before modifying anything and these errors on the duffing oscillators were there already. Just ran the tests again on master to confirm. The modifications I made shouldn't affect the other methods anyway.

Are there modifications outside master that I missed?

awage commented 2 years ago

I confirm that the tests also fail on master in my machine. I will look into it.

awage commented 2 years ago

It seems that the expected basin fractions values are slightly different on my machine: (0.509, 0.491) instead of (0.511, 0.489). It can be due to the RNG. I don't know why this happens but the temporary fix is this in the Duffing test:

    expected_fs_raw = Dict(2 => 0.509, 1 => 0.491)

KalelR commented 2 years ago

Now all the tests pass on my machine. I don't know why the supervised method is failing in the tests here. There is something different about the knn algorithm.. maybe the version is different?

Datseris commented 2 years ago

BoundsError: attempt to access 3-element Vector{Int64} at index [-1] seems to be the error. Perhaps we should modify the code so that it skips keys -1 in case teh clustering fails...?

KalelR commented 2 years ago

BoundsError: attempt to access 3-element Vector{Int64} at index [-1] seems to be the error. Perhaps we should modify the code so that it skips keys -1 in case teh clustering fails...?

But the error is occurring in knn, not the tests:

  BoundsError: attempt to access 3-element Vector{Int64} at index [-1]
  Stacktrace:
    [1] getindex
      @ ./array.jl:861 [inlined]
    [2] knn_point!(tree::KDTree{SVector{2, Float64}, Euclidean, Float64}, point::SVector{2, Float64}, sortres::Bool, dist::Vector{Float64}, idx::Vector{Int64}, skip::typeof(NearestNeighbors.always_false))
      @ NearestNeighbors ~/.julia/packages/NearestNeighbors/YCcEC/src/knn.jl:36

Datseris commented 2 years ago

Tests pass, so this is good to go, right?

KalelR commented 2 years ago

Tests pass, so this is good to go, right?

Should be, but I haven't touched the supervised/knn part, so there seems to be some variability in that error. It's strange because it never ocurred on my machine.

Datseris commented 2 years ago

Alright, but you still need to:

Pull recent master here.
Increment minor version in Project.toml
Add a CHANGELOG.md entry of the changes

Datseris commented 2 years ago

Note that the changelog is for the users, so you don't have to mention changes in the tests there :D (its okay now)

JuliaDynamics / ChaosTools.jl

Improvement/fix for optimal_radius_dbscan (WIP) #254

Method 1 (elbow method)

Method 2 (HDBSCAN)