When/how should RGDR fail if a lag has no significant clusters

BSchilperoort commented 2 years ago

In PR #85 support for analyzing multiple precursor lags w/ RGDR was added. When a lag does not contain any clusters, RGDR will raise an error. However, a discussion was started on when to raise this error. Either immediately once a certain lag was analyzed and did not contain a cluster, or once all lags have been analyzed and one or more lags do not contain a cluster.

There are pro's and con's to either.

Why raise an error immediately:

if the clustering takes long, and it fails at the first lag, then users won't have to wait for it to complete to be notified of the error.

Why raise an error after processing all lags:

Users do not have to iteratively remove lags without clusters.
Users can find "predictability windows", e.g. there is only significant correlation for a 1 month window half a year ahead of the target time.

In the current implementation, RGDR will analyze all lags first, and only then raise an error.

semvijverberg commented 2 years ago

I would prefer to raise a single (summary) error after processing all lags.

Another important point is that, for practical use in a pipeline setting, only a warning should be printed.

For example, we have a pipeline (building upon proto), where we predict EU temperature across ~20 target clusters and 12 months (240 target timeseries), searching for precursors at 3 lags. Hence, RGDR is executed for 720 times.

Soil moisture is typical precursor that predominantly finds clusters in summer and not in other seasons. There is nothing wrong with not finding precursors regions, it is important the pipeline never breaks (otherwise constructing such a pipeline becomes very tedious and time consuming. In 99% of the cases, when looping over multiple variables (e.g., SST, SM, z500) the RGDR will find some precursors for some variable(s) and thus a predictions can be made. But even for this 1%, when absolutely nothing is found, the pipeline should not crash 🙅 . Otherwise the idea of building scalable pipelines is lost.

geek-yang commented 2 years ago

Then I think we can simply change errors to warnings, to ensure that a heavy job will not be interrupted by the error.

Changes are made in PR #93.

AI4S2S / s2spy

When/how should RGDR fail if a lag has no significant clusters #86