Open veni-vidi-vici-dormivi opened 6 months ago
Note to myself: We also use Cholesky for finding the localization radius. Here we don't need to think about exposing a method choice to the user because the covariance matrix has to be positive definite for the logpdf anyways so Cholesky is the best choice since it is fastest and stable. I was thinking if there might be something else about Cholesky that is less robust than eigh and svd as discussed in this comment but I don't think so, so we can leave this as is and just think about wether we want the user to be able to choose a method for drawing emulations.
I went over this again and here is what I learned:
When fitting for an appropriate localized covariance matrix we use the negative loglikelihood, i.e. the logpdf evaluated for the training data, to estimate which covariance matrix is suitable. Strictly mathematically speaking, the pdf of a multi-variate normal only exists when the covariance matrix is positive definite (= only positive eigenvalues), this is called the non-degenerate case. However, for statistical purposes, the pdf can still be computed using the pseudo determinant and pseudo inverse, throwing away all zero eigenvalues. scipy.stats.multivariate_normal
supports this by setting the argument allow_singluar = True
. This, we don't do at the moment but after reading up on it, it think we should do it, see the side note below. Then we would also need to provide an option to not use cholesky for the training/ for the user to provide a method.
Side note: The eigenvalues of the covariance matrix basically represent a transformation of the multi-variate case to several linearly independent univariate cases, because the covariance matrix is transformed into a form where all co-variances between distinct variables (off the diagonal) become zero and only variances of the variables themselves (on the diagonal) remain. It thus makes sense that these cannot be negative since variances are always positive (unlike covariances). I am not sure if it is always given that an empirical covariance matrix has only positive eigenvalues but in either case it is guaranteed by applying the localizer that the eigenvalues all become positive. However, eigenvalues can still be zero, meaning that some variables might depend linearly on each other, called the degenerate case. Most of this is from wikipedia. I think in our case this means that some gridpoints may depend linearly on each other which from a physical perspective I redeem rather unlikely but possible - something like a point in the alps might be always around 20% cooler than the nearby gridpoint in the lowlands maybe? Thus, I would say a positive semi-definite covariance matrix is not necessarily unphysical. As far as I can tell this is also supported by the section on localizing covariances in this paper that either Shruti or Lea cited, because they don't mention that it would not be.
This happens here: https://github.com/MESMER-group/mesmer/blob/188881d30ba82765ad506d8bd97b405c44e6d120/mesmer/stats/_localized_covariance.py#L332-L333
At the moment the only case where the localized covariance matrix that is chosen might be singular (=contain eigenvalues equal to zero) is actually if the very first localization radius leads to a singular matrix in which case we actually return the last localization radius which we then don't check for singularity. This a bug that needs fixing in any case. See here: https://github.com/MESMER-group/mesmer/blob/188881d30ba82765ad506d8bd97b405c44e6d120/mesmer/stats/_localized_covariance.py#L288-L297
This part is also a bit messy because the warning "Singular matrix for localization_radius of {localization_radius}. Skipping this radius." is now invoked by scipy.stats.Covariance.from_cholesky(np.linalg.cholesky(covariance))
and not anymore by scipy.stats.multivariate_normal.logpdf(data, cov=cov)
. While the statement is still true we might want to change it to the same warning we use when cholesky throws an error in the drawing method here. Also my understanding is we do not actually skip this radius when this happens but stop the fitting and return the previous radius.
All in all, I think we should allow a singular matrix overall and use eigh()
as a fallback. Moreover, we should take a close look at the minimize_local_discrete
function again, streamline the warnings and possibly already include #312 in this (and do #35 before?). I don't think exposing a method argument to the user is meaningful though if eigh() is quick and robust and the only other option would be svd which is slowest.
Thanks for looking into this again. I am a bit unsure about the fallback option
_get_neg_loglikelihood
for the different localisation radii)? Are the computed log-likelihoods consistent with each other? (I.e. could the cholesky loglikelihood be twice as big as the eigh for the same radius?) Or maybe this is just fine...I think we use the localization radius before the LinalgError
because all subsequent radii also lead to an error. But maybe I misremember or this is no longer correct for cholesky (or eigh).
So I currently have a small preference for erroring and an argument - not because someone would like to choose svd, but to make it explicit, in the sense of - for this model we use eigh. But I may be overthinking this and can be convinced otherwise.
Could it be a problem if we switch back and forth between cholesky and eigh
Good point, I did not think about that. In theory, the value of the loglikelihood should be invariant to the decomposition method. In practice, it can lead to numerically different results but the different should be small. I would say small enough to not influence the choice of the radius.
Is it an issue if we use eigh for the log likelihood, but then are able to use cholesky for the multivariate normal (is that even an option possible?)
This should be possible because the covariance matrix itself is not influenced by the decomposition matrix. The only thing is that it could be singular and we would need the fallback option.
I think we use the localization radius before the LinalgError because all subsequent radii also lead to an error. But maybe I misremember or this is no longer correct for cholesky (or eigh).
Ah that is good to know, I will check this but it makes sense because a larger radius means less damping of correlations means that it becomes more likely to have linear dependencies. However it is still a bug that if the first radius already leads to a singular matrix, we return the largest radius no? Now that I think about it more it is probably the best option but we should probably be more verbose about it.
make it explicit, in the sense of - for this model we use eigh
I see what you mean, I think using eigh for the whole model is maybe cleanest but maybe displaying a warning "encountered singular covariance matrix - switching to eigh()" would suffice? With what you said above that every subsequent radius also leads to a singular matrix we would only switch once. It would be nice to have an example and see if maybe the covariance matrix that is still positiv definite is maybe the better one anyways...
However it is still a bug that if the first radius already leads to a singular matrix, we return the largest radius no?
Yes - I did not consider this case yesterday. I think what we have to do is turn L77 into an error:
Generally ignoring inf
in a minimization routine is fine, but if all of them are inf
it should probably raise an error. So, good catch!
Indeed switching the decomposition does not seem to change the nll - running the code snippet below with and without the diff results in the same nll.
@veni-vidi-vici-dormivi can you try the same with a non-trivial localizer to confirm? See
I see what you mean, I think using eigh for the whole model is maybe cleanest but maybe displaying a warning "encountered singular covariance matrix - switching to eigh()" would suffice?
Given the above tests I think it's ok to do. My last reservation is that it may lead to very ugly code (how do we avoid checking cholesky first for every iteration?). But maybe this can be done elegantly... (And it's code elegance is not a good reason to not do something)
We now use Cholesky decomposition
for finding the localization radius andfor drawing innovations (see #408). When the covariance matrix is not positive definite we useeigh()
. Depending on how often this is actually the case we could switch to using Cholesky and if it fails, the user needs to provide an alternative method.Another add on: localizing the covariance matrix ensures positive semi-definiteness (see #402 for an explanation) but not positive definiteness which is needed for Cholesky. We could check already in
_localized_covariance
if cov is positive definite and set the decomposition method there. We could implement several options here for the user: 1) have the option for the user to enforce positive definiteness of the adjusted covariance matrix throw an error if not. However, this is not overly useful I think, because what is the user gonna do then? 2) warn if the adjusted covariance is not positive definite and set decomposition method for drawing emulations toeigh()
or some user given option 3) in any case, give the option to chose the decomposition method when drawing emulations in_auto_regression
.In combination with this it maybe would make sense to completely factor out all covariance wrangling into its own module... but then nothing is really happening in
_localized covariance
... Maybe the best option would be to make acovariance_utils
module to be used in_localized_covariance
and _auto_regression. In_auto_regression
also some refactoring could be done, see #165.