Open fmendelssohn opened 3 years ago
Yes, this is because the root decomposition estimated from Lanczos is completely wrong (with the identity matrix all eigenvalues are zero and thus repeated):
root_est = mvn.lazy_covariance_matrix.root_decomposition()
torch.norm(root_est.evaluate() - torch.eye(n)) / torch.eye(n).norm()
# tensor(0.9997)
root.root.evaluate().svd()[1]
# tensor([1.7319e+00, 1.9072e-02, 6.6341e-04])
# note that the root itself is 10k x 3
# the estimated eigenvalues are squares of these singular values
Note that the relative error of the approximate root decomposition is 1 (so completely wrong).
In general, Lanczos is going to work better when the eigenvalues are much more separated and not entirely one.
Thanks @wjmaddox for clarifying. This is indeed a toy example to demonstrate the issue; I originally ran into this with an RBF kernel matrix from the dataset I've been working on and it took me a while to narrow down the issue with the problematic samples. I wonder:
If the package should at least raise a warning when Lanczos is used on such covariance matrices? (The phase-transition from n = 1000
to larger was also a bit confusing at first, before I found that it was caused by the switch from Cholesky to Lanzcos).
I'm not an expert on this, but I'm curious if there are potentially other solutions for drawing random samples from larger covariance matrices? Given the caveats you noted with Lanzcos and the fact that even the IID case is a pathology here, I wonder if one should stick with Cholesky and simply give-up for n
too large rather than using potentially problematic samples whose accuracy one has no knowledge of or control over? This seems rather dangerous for downstream applications (as I have learned first-hand).
Incidentally, I wonder if similar issues are present in other GP learning/inference procedures in GPyTorch?
Thanks and curious to hear your thoughts.
Interesting, I'm somewhat surprised to see that you were running into issues with an RBF kernel as there you should just be able to crank up the size of the Lanczos decomposition...
In your setting, it's pretty much impossible to tell without expensive checks that are equivalent to either matrix multiplications like the norm check I did or just running symeig on the matrix.
The other obvious solution would be iterative in nature if your matrix has no structure in general --- for example, the contour integral quadrature sampling procedure that can be turned on. Unless you wanted to use random fourier features (RFFs) instead?
Yes, numerical issues play a pretty big role in GPyTorch performance. The defaults are pretty sensible, but don't work for every situation --- see both my recent workshop paper for a bit of a practical perspective and Geoff's ICML paper that dives into the statistical properties
If the package should at least raise a warning when Lanczos is used on such covariance matrices? (The phase-transition from n = 1000 to larger was also a bit confusing at first, before I found that it was caused by the switch from Cholesky to Lanzcos).
You can use the context manager with gpytorch.settings.verbose_linalg(True)
to see what operations are being performed.
I'm curious about the particular RBF kernel - do you remember approximately what the lengthscale was, the dimensionality of the data, and was there any added observational noise? Numerical issues will arise with very low/very large lengthscales and/or little observational noise.
@jacobrgardner and I were also considering using pivoted cholesky rather than Lanczos for sampling from these matrices - we would have better theoretical guarantees and it would behave a bit more like Cholesky-based sampling.
This is a pretty old issue, but it looks like it ended with some open questions about the impact of this problem when using an RBF kernel. I can confirm that the issue is quite severe and happens even with quite sane settings for lengthscale and likelihood variance - as long as the number of samples is high, it looks like it's always wrong.
See this image, where I draw 5 samples from the posterior of a GP conditioned on the two black datapoints, with a lengthscale of 0.5 and likelihood variance of 0.1. As the input size $n$ grows, there's a huge drop in noise.
Note that the issue disappears when enabling gpytorch.settings.ciq_samples
at the expense of some slowdown. I've included the code to generate this plot below.
(PyTorch 2.2.0, GPyTorch 1.11, running on an A100 on Linux)
To add to my previous comment - I'm not sure whether it's a fair comparison, but sampling from the vanilla PyTorch torch.distributions.MultivariateNormal
appears to substantially outperform gpytorch.distributions.MultivariateNormal
in terms of both time and stability to high $n$. It seems that GPyTorch does some caching of intermediate results, but even after that it's slower than PyTorch. Is there anything else that means GPyTorch must have its own separate implementation?
I'm not sure about the noise issue. We have wanted to replace the sampling code with something akin to pathwise sampling, but such an effort will likely need to be contributed by the community.
Regarding the second issue: the gpytorch multivariate normal is fastest when it is used in conjunction with LinearOperators (rather than dense covariance matrices).
TLDR: the sampling code definitely needs revisiting, but it'll require a community PR.
We have wanted to replace the sampling code with something akin to pathwise sampling
FWIW, we have implemented support for pathwise sampling in BoTorch, but it doesn't support generic GPyTorch models at this point: https://github.com/pytorch/botorch/tree/main/botorch/sampling/pathwise
Possible bug in sampling from MultivariateNormal with Lanczos
Sampling from
MultivariateNormal
appears to be off by (at least) a scaling when Lanczos is used (for largern
). (I'm relatively new to GPyTorch, so could be mistaken somewhere.)To reproduce
I've isolated the issue to a pretty minimal example:
Expected Behavior
The correct output should be around
1.0
(the marginal SD of independent Normal r.v.s), but with then = 10000
setting above, the output is aroundtensor(0.02)
. The output is correct when Cholesky is used (for smallern
).System information