Explore `HDBSCAN` as a replacement for `DBSCAN` in RGDR

BSchilperoort commented 1 year ago

I recently stumbled upon the alternative clustering method HDBSCAN. They promise the following:

Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning -- and the primary parameter, minimum cluster size, is intuitive and easy to select.

And also:

In particular performance on low dimensional data is better than sklearn's DBSCAN

Not only this, but it seems to be basically a drop-in replacement of DBSCAN which we currently use, so this could be quite interesting to explore to make RGDR more robust as well as perform better.

semvijverberg commented 1 year ago

Cool! I stumbled upon this method a long time ago thinking I should revisit but I completely forgot!

jannesvaningen commented 1 year ago

Okay guys, I have spent some time on this. HDBSCAN is in principle an improvement over DBSCAN, but I'm not really sure yet whether it is a real improvement for us. I'll give some explanation here. I can also give a presentation showing a notebook soon.

The best improvement of HDBSCAN over DBSCAN is that it does not use one lambda parameter (the eps parameter) to determine the number of clusters. Instead, it maximizes the total sum of persistence of the clusters under the constraint that the chosen clusters are non-overlapping. Bit less formally: it looks if splitting one cluster into two results in more 'mass' than before. If it does, it splits the cluster. If it doesn't, it keeps it as one. That way, it determines the lambda parameters itself. Source: https://pberba.github.io/stats/2020/01/17/hdbscan/

As promised, the only parameter that needs tuning is minimum cluster size. It is intuitive to use, because you can indicate that you only want clusters with size > 5 cells. This is arguably better than the eps_km parameter because it requires the user to have some idea about the size of the data. However, although this parameter is easy to use, it can also lead to some 'cutoff' scenarios where there are only regions found with minimum_cluster_size <5, so setting it to the (default) 5 leads to no regions being found at all. So does it lead to more robust clusters? I don't know to be honest.

I also tested the speed in the notebook and it does not look like HDBSCAN is much faster than DBSCAN. It was actually slower in my case.

We (@semvijverberg and @geek-yang ) discussed this already a bit, and one way to proceed could be to use HDBSCAN with minimum_cluster_size is 2 (the lowest setting) and then use @BSchilperoort his extra layer of removing areas with min_area_km2. Maybe we could also look at the correlation of the ts between regions like @semvijverberg has suggested.

BSchilperoort commented 1 year ago

Thanks for exploring this, Jannes! I have some questions!

Did you test it on high resolution data? (instead of the very coarse data we have for testing).
Those plots are nice, but are mostly for data with many more points than what we have. What does HDBSCAN's clusterer.condensed_tree_.plot() look like for the s2s data?
Do the clusters come out basically the same with HDBSCAN?

geek-yang commented 1 year ago

Just saw your post. We discussed the results last Wednesday.

Did you test it on high resolution data? (instead of the very coarse data we have for testing).

Jannes tested it on a larger dataset with higher resolution. But the results are similar to those with coarse data.

Those plots are nice, but are mostly for data with many more points than what we have. What does HDBSCAN's clusterer.condensed_tree_.plot() look like for the s2s data?

@jannesvaningen Can you comment on it?

Do the clusters come out basically the same with HDBSCAN?

The clusters are similar in general, though some details are different. But as DBSCAN, the results are not very robust, especially for those edge points.

These methods are designed to cluster data based on the density, which is actually the difference in distance. However, since our data is on structured grid, it is difficult in some cases. We might be able to get robust results with unevenly distributed data, I guess. Actually for ocean modelling, their data is always on unstructured grid. Maybe we can test our methods using some oceanic reanalysis data, e.g. ORAS5, SODA3.

Anyway, I think HDBSCAN is a nice option to add, at least we provide an alternative for the user.

AI4S2S / s2spy

Explore `HDBSCAN` as a replacement for `DBSCAN` in RGDR #136