How do we merge clusters over different training splits

jannesvaningen commented 2 years ago

One of the main reasons we couldn't use s2spy during the Lorenz workshop was that the labels of the precursor regions weren't aligned over the training splits. When using RGDR to identify precursor regions of interest, we follow the following procedure:

find correlating gridcells (rgdr.get_correlation), here we get the correlation value and the p value
make use of SKlearn dbscan (rgdr.get_clusters) to identify clusters of significantly correlating grid cells that lie in each other's vicinity. We can set the alpha value and tweak some parameters of dbscan like distance_eps and min_area_km2.

Uptil now, we haven't thought about a way to somehow align the areas over training splits. The example below shows that it is not trivial to match these areas. Here I have used 4 splits over the data in /tests to look at the clustered regions over the splits. I have adapted the plotting function in rgdr a bit to get the same colorbars for every figure. Note:

The resolution is quite low, higher res might decrease the ambiguity
The splitting is not so 'random', we miss 1980-1989 in the first plot as it is testing data. 1990-1999 in the 2nd plot and so on. With less and more random test years the found clusters over the splits can be expected to be more similar.

We could come up with some algorithm that mimics what we would identify as one cluster by eye. There are some things to consider:

What do we use a rule to determine whether areas are the same areas between splits? We have discussed distance-based rules or a rule of overlapping areas, but I doubt whether one of them would work in the example above.
The parameters of DBscan matter a lot for identifying the clusters
How do we communicate the decisions we make to the user?
How much flexibility do we want to give the user? Do we go for one implementation that we know works most of the time, or do we let the user change clusters if the result is not satisfactory? I know Sem sometimes merges clusters of which he knows from expert knowledge that they should belong together.

geek-yang commented 2 years ago

Thanks for raising this issue. In my opinion, this task really needs expert knowledge as in some cases the areas might not overlap (due to p value masks) but they can still belong to the same group (e.g. areas in Pacific form the same sst horse shoe pattern but are separated due to p values). Fortunately, given that normally you won't have hundreds of train splits, it is logical that the user can use their expert knowledge and adjust some labels manually.

So, I think what we are doing here, is to provide a basic landscape that has plausible results and doesn't require too much correction from the user if they dislike it. It is not possible to be perfect as we rely too much on the outcome from RGDR, but the results should make sense for as many cases as possible.

We can design an algorithm to at least label those easy cases correctly for the user. We can use the area comparison method suggested by @Peter9192 . This can be a utility function that takes the clustered maps (e.g. a list of maps) as input rgdr.align_labels(cluster_maps, overlap_area = 0.5).

geek-yang commented 2 years ago

About the algorithm, here are my thoughts:

Initialize a pattern list with patterns from one cluster map within the given maps list (e.g. we can pick a cluster map covering the largest area)
Loop through clustered maps and compare labelled areas with the patterns we have in the list. If the overlapped area is larger than the threshold, give the same label
If not, then add this pattern to the pattern map and give a new label

Peter9192 commented 2 years ago

Thanks for opening the issue and describing it so clearly. Looking at this example, I agree that it is not trivial to "align" the clusters, as there doesn't seem to be an obvious alignment even by eye. As you say, it may be different for other usecases/examples, but if we want to come up with something general, perhaps we should take a step back first.

Instead of a function like align_clusters, could we perhaps create a function to score the robustness of the clusters? So given a series of cluster maps, calculate some diagnostics, such as:

What is the average number of clusters in a cluster map, and how big is the variance around that mean?
What is the average cluster size and how big is the variance around that size?
Are there any cells that never occur in a cluster? How many, and where are they?
Are there any cells that always occur in a cluster?
If a single cluster map consists of 1's and 0's, can we add all maps together to get a compound cluster map? Do we see hotspots on that map that we could treat as overall clusters? E.g. a number of connected grid cells with score >=3.
Similarly, if we calculate the cluster centers, and plot them all on a single map, do we see "clusters of cluster centers" appear?
...

Only if we are able to judge whether the clusters are robust, can we start thinking of 'merging' or 'aligning' them.

jannesvaningen commented 2 years ago

Thanks for the comments on this issue @Peter9192 and @geek-yang.

I like the suggestion of Peter to run some diagnostics over the clusters. It would be very cool if in the end we can have a sort of final map showing clusters with shaded colours over gridcells (the darker the more robust) so you can see that in some splits you have found some significantly correlating gridcells but not in others.

This even sparks an idea that you can in the end use the timeseries of all the regions but with weights based on how many times a region is found over every split.

I'll continue with a simple method that compares regions for now, like the align_cluster function proposed by Yang and see what we get.

AI4S2S / s2spy

How do we merge clusters over different training splits #101