Open BSchilperoort opened 1 year ago
Cool! I stumbled upon this method a long time ago thinking I should revisit but I completely forgot!
Okay guys, I have spent some time on this. HDBSCAN is in principle an improvement over DBSCAN, but I'm not really sure yet whether it is a real improvement for us. I'll give some explanation here. I can also give a presentation showing a notebook soon.
The best improvement of HDBSCAN over DBSCAN is that it does not use one lambda parameter (the eps parameter) to determine the number of clusters. Instead, it maximizes the total sum of persistence of the clusters under the constraint that the chosen clusters are non-overlapping. Bit less formally: it looks if splitting one cluster into two results in more 'mass' than before. If it does, it splits the cluster. If it doesn't, it keeps it as one. That way, it determines the lambda parameters itself.
Source: https://pberba.github.io/stats/2020/01/17/hdbscan/
As promised, the only parameter that needs tuning is minimum cluster size. It is intuitive to use, because you can indicate that you only want clusters with size > 5 cells. This is arguably better than the eps_km parameter because it requires the user to have some idea about the size of the data. However, although this parameter is easy to use, it can also lead to some 'cutoff' scenarios where there are only regions found with minimum_cluster_size <5, so setting it to the (default) 5 leads to no regions being found at all. So does it lead to more robust clusters? I don't know to be honest.
I also tested the speed in the notebook and it does not look like HDBSCAN is much faster than DBSCAN. It was actually slower in my case.
We (@semvijverberg and @geek-yang ) discussed this already a bit, and one way to proceed could be to use HDBSCAN with minimum_cluster_size is 2 (the lowest setting) and then use @BSchilperoort his extra layer of removing areas with min_area_km2. Maybe we could also look at the correlation of the ts between regions like @semvijverberg has suggested.
Thanks for exploring this, Jannes! I have some questions!
clusterer.condensed_tree_.plot()
look like for the s2s data?Just saw your post. We discussed the results last Wednesday.
- Did you test it on high resolution data? (instead of the very coarse data we have for testing).
Jannes tested it on a larger dataset with higher resolution. But the results are similar to those with coarse data.
- Those plots are nice, but are mostly for data with many more points than what we have. What does HDBSCAN's
clusterer.condensed_tree_.plot()
look like for the s2s data?
@jannesvaningen Can you comment on it?
- Do the clusters come out basically the same with HDBSCAN?
The clusters are similar in general, though some details are different. But as DBSCAN, the results are not very robust, especially for those edge points.
These methods are designed to cluster data based on the density, which is actually the difference in distance. However, since our data is on structured grid, it is difficult in some cases. We might be able to get robust results with unevenly distributed data, I guess. Actually for ocean modelling, their data is always on unstructured grid. Maybe we can test our methods using some oceanic reanalysis data, e.g. ORAS5, SODA3.
Anyway, I think HDBSCAN is a nice option to add, at least we provide an alternative for the user.
I recently stumbled upon the alternative clustering method
HDBSCAN
. They promise the following:And also:
Not only this, but it seems to be basically a drop-in replacement of DBSCAN which we currently use, so this could be quite interesting to explore to make RGDR more robust as well as perform better.