Improveclustering - Githubissues

KalelR commented 1 year ago

I took a deeper look into the clustering version, and improved some things:

Added num_attempts_radius for user to choose how many radii the optimal radius function tries out. Tests reveal (in AdvancedBasins repo) that even 10 can be enough, instead of the 200 we were using before. A great speedup basically for free!
New silhouettes function (used to evaluate the quality of the clustering). I'm pretty sure the previous one had bugs, which are fixed now. It's unfortunately slower, I dont know why. Making it faster would improve the clustering a lot since it's called every iteration. The only way I can think involves re-writing it almost completely, might be worth it if we want maximum possible speed.
Switch to always using the version of dbscan that receives the distances matrix; now we can use any metric we want (just calculate dists = pairwise(features, metric)). Previously we weren't using (my bad!). As a consequence, I needed to define new function for getting the cluster labels (same labelling standard as before, doesnt affect the final results)
No more calculation of cluster_errors, we weren't using them anyway.
Return optimal cluster directly from optimal radius function. For this, I needed to create the get_clusterlabels function, to handle each optimal radius method separately. Previously, we calculated the clusters to calculate the silhouettes, then found the radius for optimal silhouette, and then calculated the cluster again with the optimal radius. This is a waste of one call, it's more efficient to just save the clusters as we are looking for the optimal radius.

So we can get a large speedup by reducing num_attempts_radius, and the silhouettes version is more correct (though slower). One to-do would be to improve the silhouettes function, I'm open to suggestions. Otherwise I can take another look later.

Datseris commented 1 year ago

Have you tried using Optim.jl to find the minimum of the function you are trying to minimize with the silhuettes?

Datseris commented 1 year ago

(I'll look at the rest later)

KalelR commented 1 year ago

Have you tried using Optim.jl to find the minimum of the function you are trying to minimize with the silhuettes?

No, for two reasons

Since we can decrease num_attempts_radius to even 10, optimizing the iterative procedure wouldn't improve the speed that much.
It seemed to me that Optim.jl only works for functions with an explicit dependence on the variables. In our case, we want to optimize the silhouettes, which depend on the clustering, with all its details. Is there still a way to make it work?

Datseris commented 1 year ago

of course, define a function that computes

f = function (ε)
        dists = pairwise(metric, features)
        clusters = dbscan(dists, ε, min_neighbors)
        sils = silhouettes_new(clusters, dists)
        return -mean(sils)
end

given ε. Return - the mean, because Optim find minima not maxima. Here is an example of how I use optim to find e.g. the closest point of a trajectory to a point. https://github.com/JuliaDynamics/ChaosTools.jl/pull/278/files#diff-d0c1add27f2ef330d5bb91f4db3fc25eddf25e6972914ecb19bc0385b2c55874R181

(please make it a different method like sihlyetes_optim)

KalelR commented 1 year ago

This is cool, Optim is much more powerful than I had thought. It works now, some tests I made: Running basins_fractions with num_attempts = 200 no optim: 5.4s optim: 1.4s og: 7.7s

Running the optimal_radius function directly num_attempts = 20 no optim: 402.8 ms optim: 305ms;

num_attempts = 200 no optim: 5.1 s optim: 564ms

A small point: for the optim function, the interval of radii values needs to start at a quite small value (I fixed it to be the minimum range of the features divided by 100). Bigger values lead to it finding worse radii (with smaller silhouettes) than the version without optim.

Datseris commented 1 year ago

That is awesome. So Optim.jl seems to be always the best. One more improvement we can highlight in our paper ;)

Please finish the PR w.r.t. documentation and tests (in the clustering tests we have now, simply loop over all of the different clustering ways). Then I can review and if no issues arise, merge directly!

KalelR commented 1 year ago

Done.

Currently, the original method is our original implementation, already an improvement on the bSTAB author's implementation (e.g. optimizing the mean silhouette instead of the minimum). I left it that way so we could compare our improvements, but I'd say we can just delete it after we are done. It could be useful to leave it reverted to the author's original implementation. The doc currently states that actually.

Datseris commented 1 year ago

Please revert it to the author's original implementation then ;) Please add a comment in the source code that using mean instead of minimum is better, but now we have even better methods so we don't bother with it at all!

Datseris commented 1 year ago

@maximilian-gelbrecht how do you guys find the optimal radius for the clustering?

KalelR commented 1 year ago

Done. It's indeed worse than using mean, it doesnt even pass the tests (i checked on henon and lorenz and it wasnt finding a good radius, missed an entire attractor in lorenz).

maximilian-gelbrecht commented 1 year ago

@maximilian-gelbrecht how do you guys find the optimal radius for the clustering?

Mostly based on the k-dist graph as suggested in the Ester et al DBSCAN paper. Either by finding the knee/elbow in the sorted graph or by taking some other statistics/quantile of the cumulative k-dist.

We did also experiment with continuing the integration from slightly perturbed initial conditions and deriving a radius from this. However, this lead to very similar results as with the k-dist graph methods. This made me think about a more specialised density based clustering though. I called it BBClustering, see the attached file. I implemented it, tested it, but then I ran out of time to do something with it, but maybe you find it interesting: From the appendix of my PhD thesis: bbclustering.pdf

Datseris commented 1 year ago

@KalelR The optim method was the best clearly right? Because it's not the default. Shouldn't we make it the default?

Datseris commented 1 year ago

I'm making some quick changes now, and will also do that.

Datseris commented 1 year ago

@KalelR would you mind clearing up a bit the source code...? There are too many functions being called, too many functions in the source code, and too little harmony between the names... E.g., we have:

"""
Finds the cluster labels for each of the optimal radius methods. The labels are either
`-1` for unclustered points or 1...numberclusters for clustered points.
"""
function _get_clusterlabels(features, min_neighbors, metric, optimal_radius_method,
    num_attempts_radius)
    if optimal_radius_method == "silhouettes_mean"
        cluster_labels = findcluster_optimal_radius_dbscan_silhouette(
            features, min_neighbors, metric; num_attempts_radius
        )
    elseif optimal_radius_method == "silhouettes_original"
        ϵ_optimal = optimal_radius_dbscan_silhouette_original(
            features, min_neighbors, metric; num_attempts_radius
        )
        clusters = dbscan(features, ϵ_optimal; min_neighbors)
        clusters, sizes = sort_clusters_calc_size(clusters)
        cluster_labels = cluster_assignment(clusters, features; include_boundary=false)
    elseif optimal_radius_method == "silhouettes_optim"
        ϵ_optimal = optimal_radius_dbscan_silhouette_optim(
            features, min_neighbors, metric; num_attempts_radius
        )
        dists = pairwise(metric, features)
        dbscanresult = dbscan(dists, ϵ_optimal, min_neighbors)
        cluster_labels = cluster_assignment(dbscanresult)
    elseif optimal_radius_method == "knee"
        ϵ_optimal = optimal_radius_dbscan_elbow(features, min_neighbors, metric)
        dists = pairwise(metric, features)
        dbscanresult = dbscan(dists, ϵ_optimal, min_neighbors)
        cluster_labels = cluster_assignment(dbscanresult)
    else
        error("Unkown `optimal_radius_method`.")
    end
    return cluster_labels
end

What's this function doing? It finds both the optimal radius, but also clusters the features with the optimal radius. Why not just separate it into two functions, one find_optimal_radius and the other that does the clustering given the optimal radius?

Then, why do we have

    if optimal_radius_method == "silhouettes_mean"
        cluster_labels = findcluster_optimal_radius_dbscan_silhouette(
            features, min_neighbors, metric; num_attempts_radius
        )
    elseif optimal_radius_method == "silhouettes_original"
        ϵ_optimal = optimal_radius_dbscan_silhouette_original(
            features, min_neighbors, metric; num_attempts_radius
        )
        clusters = dbscan(features, ϵ_optimal; min_neighbors)
        clusters, sizes = sort_clusters_calc_size(clusters)
        cluster_labels = cluster_assignment(clusters, features; include_boundary=false)

As far as I can tell, you can create one function

findcluster_optimal_radius_dbscan_silhouette(
            features, min_neighbors, metric, mean; num_attempts_radius
        )

notice the fourth argument mean, which becomes minimum if a user chooses the "original" version instead. Also, please separate finding the optimal radius in a section, with a section header as is done in other files in ChaosTools.jl, and define in that section an explicit function find_optimal_dbscan_radius.

Datseris commented 1 year ago

Here is how a function cluster_features_clustering should look like:

It exits immediatelly if there aren't enough features for clustering. BTW perhaps we should throw a warning there as well.
It calls a normalization step.
It calls the optimal radius function.
It gives the optimal radius to a clustering fnction that expects optimal radius as an argument
It extracts the labels from the clustering result, if this isn't automatic.
It returns the labels.

Each one of these numbers should be only a single level of abstraction. This means, it calls a function that does the thing.

See here https://youtu.be/x3swaMSCcYk?t=3512 , especially the slide on "one level of abstraction", that comes a couple of minutes later.

KalelR commented 1 year ago

Hey, ok

I wrote the code that way because the silhouettes method already calculates the clusters as it is looking for the optimal radius. So I just wrote findclusters_optimal_radius_dbscan_silhouette to save the clusters as it was looking for the optimal one. It saves one call to dbscan. I had to create the get_clusterlabels function to deal with each case. But since one call to dbscan is pretty fast, and symmetry in code is of course important, I'll just then return the optimal radius and re-calculate the cluster in the mean case. This way, we can do as you suggest, which is indeed much clearer.

notice the fourth argument mean

That's a great way to do it. I think it's even better to let the user control the function more explicitly. I added the parameter statistic_silhouette, which defaults to mean(). I mention in the docs that the original uses minimum.

In the future, we might further improve the optim or the linear search functions (e.g. maybe thinking of better guesses of the radii). So I think leaving the original function as was originally written leaves us more liberty to improve them and not worry about losing the original implementation. I left it in the source code, but am not calling it in the code. Feel free to delete it if you feel it's unnecessary.

Datseris commented 1 year ago

There are some functions that are not used in the source code, right? sort_clusters_calc_size and the second version of cluster_assignment aren't used for example.

KalelR commented 1 year ago

There are some functions that are not used in the source code, right? sort_clusters_calc_size and the second version of cluster_assignment aren't used for example.

Right, they can be deleted, sorry i forgot. I'm not at home now, can do that later.

Datseris commented 1 year ago

I do it no worries!

Datseris commented 1 year ago

Source code is cleaned. But there is still some work here. The new methods are not tested. In the clustering file, clustering_tests.jl you need to make a loop that tests all methods that are implemented with these analytic tests.

The second thing not clear to me is how to utilize this new interface for usage in the basin fractions continuation... There are in practice two things I Need to do:

Add parameter to the feature vector. Easy, I can do this already.
Define a custom metrix and compute a custom distance matrix and use that for clustering. How do I pass a custom distance matrix in cluster_features?

KalelR commented 1 year ago

Why do this

    silhouette_statistic = if optimal_radius_method == "silhouettes_mean"
        mean
    else
        minimum
    end

Instead of just letting silhouette_statistic be passed as a parameter as before? This is much less intuitive and flexible in my view. What if I want to use minimum with Optim? Also silhouette_optim uses minimum with this, right? The test of Optim in Lorenz is now erroring because of this...

In clustering_tests, I added a loop over the silhouettes methods. I noticed with this that the current choice of candidate radii is quite sensitive to extremes in features, and especially the Optim case is vulnerable because it is not guaranteed to find the global optimum. Maybe in the future we could design a better guess of candidate radii, based on other statistics of the features, instead of their extremes, or maybe adapting the knee method for a good initial guess.

The knee method by itself is quite bad, and it generally does not pass the tests that the others do. So I made a less strict test for it: just find the correct number of attractors. I don't know what happened, but the bulksearch method used in the knee method must have changed recently, because the method started errorring. As before, bulksearch receives features, which is a matrix. But it seems to iterate over queries as queries[j], like queries would be a vector of vectors, not a matrix, which lead to the error. I don't know when this change occurred. One solution would be to define the function for matrices, another would be to define features as vector of vectors in knee. I am very busy now with other projects, and I was struggling with the first option, so I just implemented the other.

Define a custom metrix and compute a custom distance matrix and use that for clustering. How do I pass a custom distance matrix in cluster_features?

With the way distances are calculated, any metric compatible with Pairwise should work fine, even custom ones such as the in Gelbrecht's paper. If we cant use pairwise, then probably passing distances as a parameter in ClusteringConfig is the solution.

Datseris commented 1 year ago

In the "original" code you have the clause:

s_grid[i] = 0; # considers single-cluster solution on the midpoint (following Wikipedia)

if only one cluster exists. This doesn't exist in the "normal" method we use now, that can have an arbitrary statistic. Is this on purpose?

KalelR commented 1 year ago

In the "original" code you have the clause:
s_grid[i] = 0; # considers single-cluster solution on the midpoint (following Wikipedia)
if only one cluster exists. This doesn't exist in the "normal" method we use now, that can have an arbitrary statistic. Is this on purpose?

Yes, I added a similar clause to the silhouettes_new function:
if length(clusters) == 1 return zeros(length(clusters[1])) end #all points in the same cluster -> sil = 0

so that the function itself already deals with that case.

Datseris commented 1 year ago

Okay, I am merging this in. In #279 we can see how to allow different ways of clustering across parameters.

JuliaDynamics / ChaosTools.jl

Improveclustering #280