AI4S2S / s2spy

A high-level python package integrating expert knowledge and artificial intelligence to boost (sub) seasonal forecasting
https://ai4s2s.readthedocs.io/
Apache License 2.0
20 stars 7 forks source link

How do we merge clusters over different training splits #101

Open jannesvaningen opened 2 years ago

jannesvaningen commented 2 years ago

One of the main reasons we couldn't use s2spy during the Lorenz workshop was that the labels of the precursor regions weren't aligned over the training splits. When using RGDR to identify precursor regions of interest, we follow the following procedure:

  1. find correlating gridcells (rgdr.get_correlation), here we get the correlation value and the p value
  2. make use of SKlearn dbscan (rgdr.get_clusters) to identify clusters of significantly correlating grid cells that lie in each other's vicinity. We can set the alpha value and tweak some parameters of dbscan like distance_eps and min_area_km2.

Uptil now, we haven't thought about a way to somehow align the areas over training splits. The example below shows that it is not trivial to match these areas. Here I have used 4 splits over the data in /tests to look at the clustered regions over the splits. I have adapted the plotting function in rgdr a bit to get the same colorbars for every figure. Note:

We could come up with some algorithm that mimics what we would identify as one cluster by eye. There are some things to consider:

geek-yang commented 2 years ago

Thanks for raising this issue. In my opinion, this task really needs expert knowledge as in some cases the areas might not overlap (due to p value masks) but they can still belong to the same group (e.g. areas in Pacific form the same sst horse shoe pattern but are separated due to p values). Fortunately, given that normally you won't have hundreds of train splits, it is logical that the user can use their expert knowledge and adjust some labels manually.

So, I think what we are doing here, is to provide a basic landscape that has plausible results and doesn't require too much correction from the user if they dislike it. It is not possible to be perfect as we rely too much on the outcome from RGDR, but the results should make sense for as many cases as possible.

We can design an algorithm to at least label those easy cases correctly for the user. We can use the area comparison method suggested by @Peter9192 . This can be a utility function that takes the clustered maps (e.g. a list of maps) as input rgdr.align_labels(cluster_maps, overlap_area = 0.5).

geek-yang commented 2 years ago

About the algorithm, here are my thoughts:

Peter9192 commented 2 years ago

Thanks for opening the issue and describing it so clearly. Looking at this example, I agree that it is not trivial to "align" the clusters, as there doesn't seem to be an obvious alignment even by eye. As you say, it may be different for other usecases/examples, but if we want to come up with something general, perhaps we should take a step back first.

Instead of a function like align_clusters, could we perhaps create a function to score the robustness of the clusters? So given a series of cluster maps, calculate some diagnostics, such as:

Only if we are able to judge whether the clusters are robust, can we start thinking of 'merging' or 'aligning' them.

jannesvaningen commented 2 years ago

Thanks for the comments on this issue @Peter9192 and @geek-yang.

I like the suggestion of Peter to run some diagnostics over the clusters. It would be very cool if in the end we can have a sort of final map showing clusters with shaded colours over gridcells (the darker the more robust) so you can see that in some splits you have found some significantly correlating gridcells but not in others.

This even sparks an idea that you can in the end use the timeseries of all the regions but with weights based on how many times a region is found over every split.

I'll continue with a simple method that compares regions for now, like the align_cluster function proposed by Yang and see what we get.