Reconsider try Cholesky approach

veni-vidi-vici-dormivi commented 6 months ago

We now use Cholesky decomposition ~~for finding the localization radius and~~ for drawing innovations (see #408). When the covariance matrix is not positive definite we use eigh(). Depending on how often this is actually the case we could switch to using Cholesky and if it fails, the user needs to provide an alternative method.

Another add on: localizing the covariance matrix ensures positive semi-definiteness (see #402 for an explanation) but not positive definiteness which is needed for Cholesky. We could check already in _localized_covariance if cov is positive definite and set the decomposition method there. We could implement several options here for the user: 1) have the option for the user to enforce positive definiteness of the adjusted covariance matrix throw an error if not. However, this is not overly useful I think, because what is the user gonna do then? 2) warn if the adjusted covariance is not positive definite and set decomposition method for drawing emulations to eigh() or some user given option 3) in any case, give the option to chose the decomposition method when drawing emulations in _auto_regression.

In combination with this it maybe would make sense to completely factor out all covariance wrangling into its own module... but then nothing is really happening in _localized covariance... Maybe the best option would be to make a covariance_utils module to be used in _localized_covariance and _auto_regression. In _auto_regression also some refactoring could be done, see #165.

veni-vidi-vici-dormivi commented 5 months ago

Note to myself: We also use Cholesky for finding the localization radius. Here we don't need to think about exposing a method choice to the user because the covariance matrix has to be positive definite for the logpdf anyways so Cholesky is the best choice since it is fastest and stable. I was thinking if there might be something else about Cholesky that is less robust than eigh and svd as discussed in this comment but I don't think so, so we can leave this as is and just think about wether we want the user to be able to choose a method for drawing emulations.

mathause commented 5 months ago

For context, this happens in

https://github.com/MESMER-group/mesmer/blob/79c8ee01317528a54e05a330b25c11254105aac8/mesmer/stats/_auto_regression.py#L485-L496

veni-vidi-vici-dormivi commented 1 month ago

I went over this again and here is what I learned:

When fitting for an appropriate localized covariance matrix we use the negative loglikelihood, i.e. the logpdf evaluated for the training data, to estimate which covariance matrix is suitable. Strictly mathematically speaking, the pdf of a multi-variate normal only exists when the covariance matrix is positive definite (= only positive eigenvalues), this is called the non-degenerate case. However, for statistical purposes, the pdf can still be computed using the pseudo determinant and pseudo inverse, throwing away all zero eigenvalues. scipy.stats.multivariate_normal supports this by setting the argument allow_singluar = True. This, we don't do at the moment but after reading up on it, it think we should do it, see the side note below. Then we would also need to provide an option to not use cholesky for the training/ for the user to provide a method.

Side note: The eigenvalues of the covariance matrix basically represent a transformation of the multi-variate case to several linearly independent univariate cases, because the covariance matrix is transformed into a form where all co-variances between distinct variables (off the diagonal) become zero and only variances of the variables themselves (on the diagonal) remain. It thus makes sense that these cannot be negative since variances are always positive (unlike covariances). I am not sure if it is always given that an empirical covariance matrix has only positive eigenvalues but in either case it is guaranteed by applying the localizer that the eigenvalues all become positive. However, eigenvalues can still be zero, meaning that some variables might depend linearly on each other, called the degenerate case. Most of this is from wikipedia. I think in our case this means that some gridpoints may depend linearly on each other which from a physical perspective I redeem rather unlikely but possible - something like a point in the alps might be always around 20% cooler than the nearby gridpoint in the lowlands maybe? Thus, I would say a positive semi-definite covariance matrix is not necessarily unphysical. As far as I can tell this is also supported by the section on localizing covariances in this paper that either Shruti or Lea cited, because they don't mention that it would not be.

This happens here: https://github.com/MESMER-group/mesmer/blob/188881d30ba82765ad506d8bd97b405c44e6d120/mesmer/stats/_localized_covariance.py#L332-L333

At the moment the only case where the localized covariance matrix that is chosen might be singular (=contain eigenvalues equal to zero) is actually if the very first localization radius leads to a singular matrix in which case we actually return the last localization radius which we then don't check for singularity. This a bug that needs fixing in any case. See here: https://github.com/MESMER-group/mesmer/blob/188881d30ba82765ad506d8bd97b405c44e6d120/mesmer/stats/_localized_covariance.py#L288-L297

And here: https://github.com/MESMER-group/mesmer/blob/188881d30ba82765ad506d8bd97b405c44e6d120/mesmer/core/utils.py#L58-L75

This part is also a bit messy because the warning "Singular matrix for localization_radius of {localization_radius}. Skipping this radius." is now invoked by scipy.stats.Covariance.from_cholesky(np.linalg.cholesky(covariance)) and not anymore by scipy.stats.multivariate_normal.logpdf(data, cov=cov). While the statement is still true we might want to change it to the same warning we use when cholesky throws an error in the drawing method here. Also my understanding is we do not actually skip this radius when this happens but stop the fitting and return the previous radius.

All in all, I think we should allow a singular matrix overall and use eigh() as a fallback. Moreover, we should take a close look at the minimize_local_discrete function again, streamline the warnings and possibly already include #312 in this (and do #35 before?). I don't think exposing a method argument to the user is meaningful though if eigh() is quick and robust and the only other option would be svd which is slowest.

mathause commented 1 month ago

Thanks for looking into this again. I am a bit unsure about the fallback option

Could it be a problem if we switch back and forth between cholesky and eigh (in _get_neg_loglikelihood for the different localisation radii)? Are the computed log-likelihoods consistent with each other? (I.e. could the cholesky loglikelihood be twice as big as the eigh for the same radius?) Or maybe this is just fine...
Is it an issue if we use eigh for the log likelihood, but then are able to use cholesky for the multivariate normal (is that even an ~option~ possible?)

I think we use the localization radius before the LinalgError because all subsequent radii also lead to an error. But maybe I misremember or this is no longer correct for cholesky (or eigh).

So I currently have a small preference for erroring and an argument - not because someone would like to choose svd, but to make it explicit, in the sense of - for this model we use eigh. But I may be overthinking this and can be convinced otherwise.

veni-vidi-vici-dormivi commented 1 month ago

Could it be a problem if we switch back and forth between cholesky and eigh

Good point, I did not think about that. In theory, the value of the loglikelihood should be invariant to the decomposition method. In practice, it can lead to numerically different results but the different should be small. I would say small enough to not influence the choice of the radius.

Is it an issue if we use eigh for the log likelihood, but then are able to use cholesky for the multivariate normal (is that even an option possible?)

This should be possible because the covariance matrix itself is not influenced by the decomposition matrix. The only thing is that it could be singular and we would need the fallback option.

I think we use the localization radius before the LinalgError because all subsequent radii also lead to an error. But maybe I misremember or this is no longer correct for cholesky (or eigh).

Ah that is good to know, I will check this but it makes sense because a larger radius means less damping of correlations means that it becomes more likely to have linear dependencies. However it is still a bug that if the first radius already leads to a singular matrix, we return the largest radius no? Now that I think about it more it is probably the best option but we should probably be more verbose about it.

make it explicit, in the sense of - for this model we use eigh

I see what you mean, I think using eigh for the whole model is maybe cleanest but maybe displaying a warning "encountered singular covariance matrix - switching to eigh()" would suffice? With what you said above that every subsequent radius also leads to a singular matrix we would only switch once. It would be nice to have an example and see if maybe the covariance matrix that is still positiv definite is maybe the better one anyways...

mathause commented 1 month ago

However it is still a bug that if the first radius already leads to a singular matrix, we return the largest radius no?

Yes - I did not consider this case yesterday. I think what we have to do is turn L77 into an error:

https://github.com/MESMER-group/mesmer/blob/188881d30ba82765ad506d8bd97b405c44e6d120/mesmer/core/utils.py#L77

Generally ignoring inf in a minimization routine is fine, but if all of them are inf it should probably raise an error. So, good catch!

Indeed switching the decomposition does not seem to change the nll - running the code snippet below with and without the diff results in the same nll.

```python from mesmer.stats._localized_covariance import _ecov_crossvalidation import numpy as np localizer = {250: np.diag(np.ones(50))} np.random.seed(0) arr = np.random.randn(100, 50) arr[:3, :3] weights = np.ones(100) _ecov_crossvalidation(250, data=arr, weights=weights, localizer=localizer, k_folds=3) ``` ```diff diff --git a/mesmer/stats/_localized_covariance.py b/mesmer/stats/_localized_covariance.py index 0a9ed17..b7fed1d 100644 --- a/mesmer/stats/_localized_covariance.py +++ b/mesmer/stats/_localized_covariance.py @@ -329,7 +329,7 @@ def _get_neg_loglikelihood(data, covariance, weights): # NOTE: 90 % of time is spent in multivariate_normal.logpdf - not much point # optimizing the rest - cov = scipy.stats.Covariance.from_cholesky(np.linalg.cholesky(covariance)) + cov = scipy.stats.Covariance.from_eigendecomposition(np.linalg.eigh(covariance)) log_likelihood = scipy.stats.multivariate_normal.logpdf(data, cov=cov) # logpdf can return a scalar, which np.average does not like ```

@veni-vidi-vici-dormivi can you try the same with a non-trivial localizer to confirm? See

https://github.com/MESMER-group/mesmer/blob/188881d30ba82765ad506d8bd97b405c44e6d120/tests/unit/test_localized_covariance.py#L251

I see what you mean, I think using eigh for the whole model is maybe cleanest but maybe displaying a warning "encountered singular covariance matrix - switching to eigh()" would suffice?

Given the above tests I think it's ok to do. My last reservation is that it may lead to very ugly code (how do we avoid checking cholesky first for every iteration?). But maybe this can be done elegantly... (And it's code elegance is not a good reason to not do something)

MESMER-group / mesmer

Reconsider try Cholesky approach #413