LinkedEarth / Pyleoclim_util

Python Package for the Analysis of Paleoclimate Data. Documentation at
https://pyleoclim-util.readthedocs.io
GNU General Public License v3.0
88 stars 33 forks source link

`SsaRes.screeplot` fails when nans are present in the original series #312

Closed alexkjames closed 1 year ago

alexkjames commented 1 year ago

Describe the bug When ssa is applied to a series with missing values represented as nans, screeplot fails

To Reproduce See attached zip file for notebook recreating the problem Screeplot_Nan_Issue.zip

Expected behavior Screeplot should work in all cases

Additional context The problem seems to be due to the computation of negative eigenvalues during SSA, which itself is the product of computing the correlation matrix of a series with missing values. This seems to be a known issue, though the solution that would best fit our needs is currently unclear. Suggestions are welcome!

CommonClimate commented 1 year ago

Indeed, it smells like a an issue with a singular sample covariance matrix, which is very common once missing values arise. Suggestion: after the sample covariance matrix is assembled (C = toeplitz(c[0:M])), apply the Rao-Blackwellized Ledoit-Wolf estimator described by Chen et al.(2009) and implemented in the covar package . It would look like:

Creg, g  = covar.cov_shrink_rblw(C,n=dof)  
print('Optimal Covariance Shrinkage: {:3.2f}'.format(g)) 

where dof is the number of degrees of freedom (M-1 here, I think) It's important to keep the shrinkage factor g around as a diagnostic. I can imagine very ill-posed imputation problems necessitating a crazy high shrinkage, and you'd know right away from that diagnostic alone that it's not worth going much further. Then again, one could hope for the best...

The biggest issue I see with this is that the package is fairly old ,and may become deprecated soon. It works with Python 3.9, but i have not tried it with Python 3.10.

CommonClimate commented 1 year ago

As an added precaution, I would check for eigenvalue positivity and only export the columns of D, V that correspond to D>0.

alexkjames commented 1 year ago

The function covar.cov_shrink_rblw looks fairly lightweight. If its useful enough, we could consider just scavenging the code (with appropriate attribution) and including it in pyleoclim to avoid dependency conflicts in the future. The whole package covar is essentially just that function.

CommonClimate commented 1 year ago

Sounds like a plan, but let's first see if it helps on the data you submitted, and perhaps a wider suite (SISAL?). I won't get to this right now, so feel free to ping me next week.

Mattriks commented 1 year ago

Other options here are:

CommonClimate commented 1 year ago

Thanks for these @Mattriks, very useful suggestions.

CommonClimate commented 1 year ago

I have reached out to Adam Mieldzioc about this, but have not heard back. Since PR #321 addresses the problem for now, I will close the issue. Note for the future that we could also re-use this code for SSA: https://github.com/aj-cloete/pssa