AI4S2S / s2spy

A high-level python package integrating expert knowledge and artificial intelligence to boost (sub) seasonal forecasting
https://ai4s2s.readthedocs.io/
Apache License 2.0
20 stars 7 forks source link

Support for multiple lags in RGDR #85

Closed BSchilperoort closed 2 years ago

BSchilperoort commented 2 years ago

This PR adds support for providing multiple lags to RGDR.

Example:

>>> precursor_field = field_resampled.sst.isel(i_interval=slice(1,5)) # Multiple lags: 1 through 4
>>> rgdr = RGDR(min_area_km2=3000**2)
>>> clustered_data = rgdr.fit_transform(precursor_field)
>>> clustered_data.cluster_labels
<xarray.DataArray 'cluster_labels' (cluster_labels: 6)>
'lag:1_cluster:-2' 'lag:1_cluster:1' ... 'lag:3_cluster:-1' 'lag:4_cluster:-2'
Coordinates:
  * cluster_labels  (cluster_labels) <U20 'lag:1_cluster:-2' ... 'lag:4_clust...
    latitude        (cluster_labels) float64 36.05 29.44 37.33 29.58 38.14 39.78
    longitude       (cluster_labels) float64 223.9 185.4 221.8 190.2 217.8 219.3

Note: when plotting data, the user needs to provide the lag they want to see (unless there is only a single lag).

Additionally, I refactored the DBSCAN implementation into more manageable chunks.

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

BSchilperoort commented 2 years ago

Awesome! I'm just wondering if there's an easy way to extract clusters for a certain lag after applying RGDR, e.g. such that you could do: clustered_data.sel(lag=1). I realize that not all lags have the same number of clusters, so it's not as easy as stacking them along "lag" dimension though. Unless we just fill them with NaNs... What do you think?

As you said, not all lags have the same number of clusters, and additionally, the clusters sharing a label does not mean they represent the same physical regions. I feel like making the cluster labels a dimension along with lag would kind-of imply that.

If we want to support this kind of selection we could create a utility function, but I think that the current way of flattening is required to be able to continue with fitting a model, or to be able to put RGDR in a pipeline.

Peter9192 commented 2 years ago

clusters sharing a label does not mean they represent the same physical regions. I feel like making the cluster labels a dimension along with lag would kind-of imply that

that's a convincing point.

If we want to support this kind of selection we could create a utility function

I agree. Let's see if there's demand for that.

sonarcloud[bot] commented 2 years ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

94.3% 94.3% Coverage
0.0% 0.0% Duplication