AI4S2S / s2spy

A high-level python package integrating expert knowledge and artificial intelligence to boost (sub) seasonal forecasting
https://ai4s2s.readthedocs.io/
Apache License 2.0
20 stars 7 forks source link

Support grouping over splits in `rgdr` #58

Open BSchilperoort opened 2 years ago

BSchilperoort commented 2 years ago

Due to computational limits (applying DBSCAN for every individual train/test split might not be viable), we want to allow users to be able to 'grouping' splits in RGDR before calculating the DBSCAN clusters.

To do this we need to go through the following steps:

  1. Calculate the correlation coefficient and p-value for every fold (see #57 )
  2. Determine the p-value mask for every individual split (training data only)
  3. Reduce this mask over the split dimension with np.any
  4. Apply DBSCAN to the reduced mask
  5. Recombine the DBSCAN clusters with each split's mask. (e.g. for each split's cluster labels: cluster_labels[~split_mask] = 0.0)

This way we end up with clusters for each split, with aligned split labels.

geek-yang commented 2 years ago

Based on the discussion in issue #71, we will only provide iterator for the user to walk through all the splits. They have the flexibility to perform RGDR (or even complete ML workflow). We can further discuss whether we need a function to do "grouping over splits". But at least we can provide a notebook to show this as a usecase.