Closed KalelR closed 1 year ago
Have you tried using Optim.jl to find the minimum of the function you are trying to minimize with the silhuettes?
(I'll look at the rest later)
Have you tried using Optim.jl to find the minimum of the function you are trying to minimize with the silhuettes?
No, for two reasons
num_attempts_radius
to even 10, optimizing the iterative procedure wouldn't improve the speed that much.of course, define a function that computes
f = function (ε)
dists = pairwise(metric, features)
clusters = dbscan(dists, ε, min_neighbors)
sils = silhouettes_new(clusters, dists)
return -mean(sils)
end
given ε
. Return -
the mean, because Optim find minima not maxima. Here is an example of how I use optim to find e.g. the closest point of a trajectory to a point. https://github.com/JuliaDynamics/ChaosTools.jl/pull/278/files#diff-d0c1add27f2ef330d5bb91f4db3fc25eddf25e6972914ecb19bc0385b2c55874R181
(please make it a different method like sihlyetes_optim)
This is cool, Optim is much more powerful than I had thought. It works now, some tests I made:
Running basins_fractions
with num_attempts = 200
no optim: 5.4s
optim: 1.4s
og: 7.7s
Running the optimal_radius
function directly
num_attempts = 20
no optim: 402.8 ms
optim: 305ms;
num_attempts = 200 no optim: 5.1 s optim: 564ms
A small point: for the optim function, the interval of radii values needs to start at a quite small value (I fixed it to be the minimum range of the features divided by 100). Bigger values lead to it finding worse radii (with smaller silhouettes) than the version without optim.
That is awesome. So Optim.jl seems to be always the best. One more improvement we can highlight in our paper ;)
Please finish the PR w.r.t. documentation and tests (in the clustering tests we have now, simply loop over all of the different clustering ways). Then I can review and if no issues arise, merge directly!
Done.
Currently, the original
method is our original implementation, already an improvement on the bSTAB author's implementation (e.g. optimizing the mean silhouette instead of the minimum). I left it that way so we could compare our improvements, but I'd say we can just delete it after we are done. It could be useful to leave it reverted to the author's original implementation. The doc currently states that actually.
Please revert it to the author's original implementation then ;) Please add a comment in the source code that using mean
instead of minimum
is better, but now we have even better methods so we don't bother with it at all!
@maximilian-gelbrecht how do you guys find the optimal radius for the clustering?
Done. It's indeed worse than using mean, it doesnt even pass the tests (i checked on henon and lorenz and it wasnt finding a good radius, missed an entire attractor in lorenz).
@maximilian-gelbrecht how do you guys find the optimal radius for the clustering?
Mostly based on the k-dist graph as suggested in the Ester et al DBSCAN paper. Either by finding the knee/elbow in the sorted graph or by taking some other statistics/quantile of the cumulative k-dist.
We did also experiment with continuing the integration from slightly perturbed initial conditions and deriving a radius from this. However, this lead to very similar results as with the k-dist graph methods. This made me think about a more specialised density based clustering though. I called it BBClustering, see the attached file. I implemented it, tested it, but then I ran out of time to do something with it, but maybe you find it interesting: From the appendix of my PhD thesis: bbclustering.pdf
@KalelR The optim method was the best clearly right? Because it's not the default. Shouldn't we make it the default?
I'm making some quick changes now, and will also do that.
@KalelR would you mind clearing up a bit the source code...? There are too many functions being called, too many functions in the source code, and too little harmony between the names... E.g., we have:
"""
Finds the cluster labels for each of the optimal radius methods. The labels are either
`-1` for unclustered points or 1...numberclusters for clustered points.
"""
function _get_clusterlabels(features, min_neighbors, metric, optimal_radius_method,
num_attempts_radius)
if optimal_radius_method == "silhouettes_mean"
cluster_labels = findcluster_optimal_radius_dbscan_silhouette(
features, min_neighbors, metric; num_attempts_radius
)
elseif optimal_radius_method == "silhouettes_original"
ϵ_optimal = optimal_radius_dbscan_silhouette_original(
features, min_neighbors, metric; num_attempts_radius
)
clusters = dbscan(features, ϵ_optimal; min_neighbors)
clusters, sizes = sort_clusters_calc_size(clusters)
cluster_labels = cluster_assignment(clusters, features; include_boundary=false)
elseif optimal_radius_method == "silhouettes_optim"
ϵ_optimal = optimal_radius_dbscan_silhouette_optim(
features, min_neighbors, metric; num_attempts_radius
)
dists = pairwise(metric, features)
dbscanresult = dbscan(dists, ϵ_optimal, min_neighbors)
cluster_labels = cluster_assignment(dbscanresult)
elseif optimal_radius_method == "knee"
ϵ_optimal = optimal_radius_dbscan_elbow(features, min_neighbors, metric)
dists = pairwise(metric, features)
dbscanresult = dbscan(dists, ϵ_optimal, min_neighbors)
cluster_labels = cluster_assignment(dbscanresult)
else
error("Unkown `optimal_radius_method`.")
end
return cluster_labels
end
What's this function doing? It finds both the optimal radius, but also clusters the features with the optimal radius. Why not just separate it into two functions, one find_optimal_radius
and the other that does the clustering given the optimal radius?
Then, why do we have
if optimal_radius_method == "silhouettes_mean"
cluster_labels = findcluster_optimal_radius_dbscan_silhouette(
features, min_neighbors, metric; num_attempts_radius
)
elseif optimal_radius_method == "silhouettes_original"
ϵ_optimal = optimal_radius_dbscan_silhouette_original(
features, min_neighbors, metric; num_attempts_radius
)
clusters = dbscan(features, ϵ_optimal; min_neighbors)
clusters, sizes = sort_clusters_calc_size(clusters)
cluster_labels = cluster_assignment(clusters, features; include_boundary=false)
As far as I can tell, you can create one function
findcluster_optimal_radius_dbscan_silhouette(
features, min_neighbors, metric, mean; num_attempts_radius
)
notice the fourth argument mean
, which becomes minimum
if a user chooses the "original" version instead.
Also, please separate finding the optimal radius in a section, with a section header as is done in other files in ChaosTools.jl, and define in that section an explicit function find_optimal_dbscan_radius
.
Here is how a function cluster_features_clustering
should look like:
Each one of these numbers should be only a single level of abstraction. This means, it calls a function that does the thing.
See here https://youtu.be/x3swaMSCcYk?t=3512 , especially the slide on "one level of abstraction", that comes a couple of minutes later.
Hey, ok
I wrote the code that way because the silhouettes method already calculates the clusters as it is looking for the optimal radius. So I just wrote findclusters_optimal_radius_dbscan_silhouette
to save the clusters as it was looking for the optimal one. It saves one call to dbscan. I had to create the get_clusterlabels
function to deal with each case. But since one call to dbscan is pretty fast, and symmetry in code is of course important, I'll just then return the optimal radius and re-calculate the cluster in the mean
case. This way, we can do as you suggest, which is indeed much clearer.
notice the fourth argument mean
That's a great way to do it. I think it's even better to let the user control the function more explicitly. I added the parameter statistic_silhouette
, which defaults to mean(). I mention in the docs that the original uses minimum
.
In the future, we might further improve the optim or the linear search functions (e.g. maybe thinking of better guesses of the radii). So I think leaving the original
function as was originally written leaves us more liberty to improve them and not worry about losing the original implementation. I left it in the source code, but am not calling it in the code. Feel free to delete it if you feel it's unnecessary.
There are some functions that are not used in the source code, right? sort_clusters_calc_size
and the second version of cluster_assignment
aren't used for example.
There are some functions that are not used in the source code, right?
sort_clusters_calc_size
and the second version ofcluster_assignment
aren't used for example.
Right, they can be deleted, sorry i forgot. I'm not at home now, can do that later.
I do it no worries!
Source code is cleaned. But there is still some work here. The new methods are not tested. In the clustering file, clustering_tests.jl
you need to make a loop that tests all methods that are implemented with these analytic tests.
The second thing not clear to me is how to utilize this new interface for usage in the basin fractions continuation... There are in practice two things I Need to do:
cluster_features
?Why do this
silhouette_statistic = if optimal_radius_method == "silhouettes_mean"
mean
else
minimum
end
Instead of just letting silhouette_statistic
be passed as a parameter as before? This is much less intuitive and flexible in my view. What if I want to use minimum
with Optim
? Also silhouette_optim
uses minimum
with this, right?
The test of Optim
in Lorenz is now erroring because of this...
In clustering_tests, I added a loop over the silhouettes methods.
I noticed with this that the current choice of candidate radii is quite sensitive to extremes in features
, and especially the Optim
case is vulnerable because it is not guaranteed to find the global optimum. Maybe in the future we could design a better guess of candidate radii, based on other statistics of the features, instead of their extremes, or maybe adapting the knee
method for a good initial guess.
The knee method by itself is quite bad, and it generally does not pass the tests that the others do. So I made a less strict test for it: just find the correct number of attractors. I don't know what happened, but the bulksearch
method used in the knee
method must have changed recently, because the method started errorring. As before, bulksearch
receives features
, which is a matrix.
But it seems to iterate over queries as queries[j]
, like queries would be a vector of vectors, not a matrix, which lead to the error. I don't know when this change occurred. One solution would be to define the function for matrices, another would be to define features as vector of vectors in knee
. I am very busy now with other projects, and I was struggling with the first option, so I just implemented the other.
Define a custom metrix and compute a custom distance matrix and use that for clustering. How do I pass a custom distance matrix in cluster_features?
With the way distances are calculated, any metric compatible with Pairwise
should work fine, even custom ones such as the in Gelbrecht's paper. If we cant use pairwise, then probably passing distances as a parameter in ClusteringConfig is the solution.
In the "original" code you have the clause:
s_grid[i] = 0; # considers single-cluster solution on the midpoint (following Wikipedia)
if only one cluster exists. This doesn't exist in the "normal" method we use now, that can have an arbitrary statistic. Is this on purpose?
In the "original" code you have the clause:
s_grid[i] = 0; # considers single-cluster solution on the midpoint (following Wikipedia)
if only one cluster exists. This doesn't exist in the "normal" method we use now, that can have an arbitrary statistic. Is this on purpose?
Yes, I added a similar clause to the silhouettes_new
function:
if length(clusters) == 1 return zeros(length(clusters[1])) end #all points in the same cluster -> sil = 0
so that the function itself already deals with that case.
Okay, I am merging this in. In #279 we can see how to allow different ways of clustering across parameters.
I took a deeper look into the clustering version, and improved some things:
num_attempts_radius
for user to choose how many radii the optimal radius function tries out. Tests reveal (in AdvancedBasins repo) that even 10 can be enough, instead of the 200 we were using before. A great speedup basically for free!get_clusterlabels
function, to handle each optimal radius method separately. Previously, we calculated the clusters to calculate the silhouettes, then found the radius for optimal silhouette, and then calculated the cluster again with the optimal radius. This is a waste of one call, it's more efficient to just save the clusters as we are looking for the optimal radius.So we can get a large speedup by reducing
num_attempts_radius
, and the silhouettes version is more correct (though slower). One to-do would be to improve the silhouettes function, I'm open to suggestions. Otherwise I can take another look later.