cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.53k stars 556 forks source link

Sampling from MultivariateNormal with Lanczos #1742

Open fmendelssohn opened 3 years ago

fmendelssohn commented 3 years ago

Possible bug in sampling from MultivariateNormal with Lanczos

Sampling from MultivariateNormal appears to be off by (at least) a scaling when Lanczos is used (for larger n). (I'm relatively new to GPyTorch, so could be mistaken somewhere.)

To reproduce

I've isolated the issue to a pretty minimal example:

from gpytorch.distributions import MultivariateNormal
# n = 100  # Cholesky is called; works correctly
n = 10000  # Lanczos is used
mvn = MultivariateNormal(mean=torch.zeros(n), covariance_matrix=torch.eye(n))
mvn.rsample().std()

Expected Behavior

The correct output should be around 1.0 (the marginal SD of independent Normal r.v.s), but with the n = 10000 setting above, the output is around tensor(0.02). The output is correct when Cholesky is used (for smaller n).

System information

wjmaddox commented 3 years ago

Yes, this is because the root decomposition estimated from Lanczos is completely wrong (with the identity matrix all eigenvalues are zero and thus repeated):

root_est = mvn.lazy_covariance_matrix.root_decomposition()

torch.norm(root_est.evaluate() - torch.eye(n)) / torch.eye(n).norm()
# tensor(0.9997)

root.root.evaluate().svd()[1]
# tensor([1.7319e+00, 1.9072e-02, 6.6341e-04])
# note that the root itself is 10k x 3 
# the estimated eigenvalues are squares of these singular values

Note that the relative error of the approximate root decomposition is 1 (so completely wrong).

In general, Lanczos is going to work better when the eigenvalues are much more separated and not entirely one.

fmendelssohn commented 3 years ago

Thanks @wjmaddox for clarifying. This is indeed a toy example to demonstrate the issue; I originally ran into this with an RBF kernel matrix from the dataset I've been working on and it took me a while to narrow down the issue with the problematic samples. I wonder:

  1. If the package should at least raise a warning when Lanczos is used on such covariance matrices? (The phase-transition from n = 1000 to larger was also a bit confusing at first, before I found that it was caused by the switch from Cholesky to Lanzcos).

  2. I'm not an expert on this, but I'm curious if there are potentially other solutions for drawing random samples from larger covariance matrices? Given the caveats you noted with Lanzcos and the fact that even the IID case is a pathology here, I wonder if one should stick with Cholesky and simply give-up for n too large rather than using potentially problematic samples whose accuracy one has no knowledge of or control over? This seems rather dangerous for downstream applications (as I have learned first-hand).

  3. Incidentally, I wonder if similar issues are present in other GP learning/inference procedures in GPyTorch?

Thanks and curious to hear your thoughts.

wjmaddox commented 3 years ago

Interesting, I'm somewhat surprised to see that you were running into issues with an RBF kernel as there you should just be able to crank up the size of the Lanczos decomposition...

In your setting, it's pretty much impossible to tell without expensive checks that are equivalent to either matrix multiplications like the norm check I did or just running symeig on the matrix.

The other obvious solution would be iterative in nature if your matrix has no structure in general --- for example, the contour integral quadrature sampling procedure that can be turned on. Unless you wanted to use random fourier features (RFFs) instead?

Yes, numerical issues play a pretty big role in GPyTorch performance. The defaults are pretty sensible, but don't work for every situation --- see both my recent workshop paper for a bit of a practical perspective and Geoff's ICML paper that dives into the statistical properties

gpleiss commented 2 years ago

If the package should at least raise a warning when Lanczos is used on such covariance matrices? (The phase-transition from n = 1000 to larger was also a bit confusing at first, before I found that it was caused by the switch from Cholesky to Lanzcos).

You can use the context manager with gpytorch.settings.verbose_linalg(True) to see what operations are being performed.

I'm curious about the particular RBF kernel - do you remember approximately what the lengthscale was, the dimensionality of the data, and was there any added observational noise? Numerical issues will arise with very low/very large lengthscales and/or little observational noise.

@jacobrgardner and I were also considering using pivoted cholesky rather than Lanczos for sampling from these matrices - we would have better theoretical guarantees and it would behave a bit more like Cholesky-based sampling.

slishak-PX commented 7 months ago

This is a pretty old issue, but it looks like it ended with some open questions about the impact of this problem when using an RBF kernel. I can confirm that the issue is quite severe and happens even with quite sane settings for lengthscale and likelihood variance - as long as the number of samples is high, it looks like it's always wrong.

See this image, where I draw 5 samples from the posterior of a GP conditioned on the two black datapoints, with a lengthscale of 0.5 and likelihood variance of 0.1. As the input size $n$ grows, there's a huge drop in noise. image

Note that the issue disappears when enabling gpytorch.settings.ciq_samples at the expense of some slowdown. I've included the code to generate this plot below.

(PyTorch 2.2.0, GPyTorch 1.11, running on an A100 on Linux)

Code to reproduce ```python import gpytorch import torch from plotly.subplots import make_subplots import plotly.colors class ExactGPModel(gpytorch.models.ExactGP): def __init__( self, train_x, train_y, likelihood, use_keops=False, ): super(ExactGPModel, self).__init__(train_x, train_y, likelihood) self.mean_module = gpytorch.means.ConstantMean() if use_keops: self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.keops.RBFKernel()) else: self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel()) def forward(self, x): mean_x = self.mean_module(x) covar_x = self.covar_module(x) return gpytorch.distributions.MultivariateNormal(mean_x, covar_x) # Options likelihood_var = 0.1 rbf_lengthscale = 0.5 cuda = True use_ciq = False n_list = [500, 1000, 5000, 10000, 50000] if cuda: device = torch.device('cuda:0') else: device = torch.device('cpu') torch.manual_seed(0) kernel.lengthscale = rbf_lengthscale likelihood = gpytorch.likelihoods.GaussianLikelihood() likelihood.noise = likelihood_var x_train = torch.linspace(0, 1, 2, device=device) y_train = x_train * 0 model = ExactGPModel(x_train, y_train, likelihood, use_keops=False).to(device) model.covar_module.base_kernel.lengthscale = rbf_lengthscale model.eval() fig = make_subplots(cols=len(n_list), shared_yaxes=True, shared_xaxes=True, subplot_titles=[f"n={n}" for n in n_list]) for i, n in enumerate(n_list): x = torch.linspace(0, 1, n, device=device) with torch.no_grad(), gpytorch.settings.ciq_samples(use_ciq): dist = likelihood(model(x)) y = dist.rsample(torch.Size([5])) x = x.cpu() for j, sample in enumerate(y.cpu()): fig.add_scatter( x=x, y=sample, name=f'n={n}', showlegend=False, legendgroup=n, line_color=plotly.colors.DEFAULT_PLOTLY_COLORS[j], opacity=0.9, row=1, col=i+1, ) fig.add_scatter( x=x_train.cpu(), y=y_train.cpu(), mode="markers", marker_size=10, marker_color="black", row=1, col=i+1, showlegend=False, ) fig.update_yaxes(title_text="y", col=1).update_xaxes(title_text="x") ```
slishak-PX commented 7 months ago

To add to my previous comment - I'm not sure whether it's a fair comparison, but sampling from the vanilla PyTorch torch.distributions.MultivariateNormal appears to substantially outperform gpytorch.distributions.MultivariateNormal in terms of both time and stability to high $n$. It seems that GPyTorch does some caching of intermediate results, but even after that it's slower than PyTorch. Is there anything else that means GPyTorch must have its own separate implementation?

image

Code to reproduce ```python import time import torch from plotly.subplots import make_subplots from gpytorch.distributions import MultivariateNormal from torch.distributions import MultivariateNormal as PTMultivariateNormal cuda = True n_arr = torch.logspace(1, 4, 10).int() n_repeat = 100 torch.manual_seed(0) if cuda: device = torch.device('cuda:1') else: device = torch.device('cpu') pt_times = torch.empty(len(n_arr), n_repeat) gpy_times = torch.empty(len(n_arr), n_repeat) pt_stds = torch.empty(len(n_arr), n_repeat) gpy_stds = torch.empty(len(n_arr), n_repeat) fig = make_subplots(rows=2, shared_xaxes=True) point_kwargs = {"marker_opacity": 0.5, "marker_size": 2, "showlegend": False} for i_n, n in enumerate(n_arr): mean = torch.zeros(n, device=device) cov = torch.eye(n, device=device) mvn = MultivariateNormal(mean, cov) mvn_pt = PTMultivariateNormal(mean, cov) for i_rpt in range(n_repeat): torch.cuda.synchronize() t0 = time.perf_counter() gpy_std = mvn.rsample().std() torch.cuda.synchronize() t1 = time.perf_counter() pt_std = mvn_pt.rsample().std() torch.cuda.synchronize() t2 = time.perf_counter() gpy_times[i_n, i_rpt] = t1 - t0 pt_times[i_n, i_rpt] = t2 - t1 gpy_stds[i_n, i_rpt] = gpy_std pt_stds[i_n, i_rpt] = pt_std x_jittered = n * (0.95 + 0.1 * torch.rand(n_repeat)) fig.add_scatter(x=x_jittered, y=pt_times[i_n], mode="markers", row=1, col=1, marker_color="red", **point_kwargs) fig.add_scatter(x=x_jittered, y=gpy_times[i_n], mode="markers", row=1, col=1, marker_color="blue", **point_kwargs) fig.add_scatter(x=x_jittered, y=pt_stds[i_n], mode="markers", row=2, col=1, marker_color="red", **point_kwargs) fig.add_scatter(x=x_jittered, y=gpy_stds[i_n], mode="markers", row=2, col=1, marker_color="blue", **point_kwargs) fig.add_scatter(x=n_arr, y=pt_times.mean(axis=1), name='PyTorch', row=1, col=1, line_color="red") fig.add_scatter(x=n_arr, y=gpy_times.mean(axis=1), name='GPyTorch', row=1, col=1, line_color="blue") fig.add_scatter(x=n_arr, y=pt_stds.mean(axis=1), name='PyTorch', row=2, col=1, line_color="red", showlegend=False) fig.add_scatter(x=n_arr, y=gpy_stds.mean(axis=1), name='GPyTorch', row=2, col=1, line_color="blue", showlegend=False) fig.update_layout(barmode='overlay', height=700) fig.update_yaxes(title_text="Mean time per sample (s)", type="log", row=1) fig.update_yaxes(title_text="Empirical standard deviation", row=2) fig.update_xaxes(title_text="Number of samples", type="log") ```
gpleiss commented 7 months ago

I'm not sure about the noise issue. We have wanted to replace the sampling code with something akin to pathwise sampling, but such an effort will likely need to be contributed by the community.

Regarding the second issue: the gpytorch multivariate normal is fastest when it is used in conjunction with LinearOperators (rather than dense covariance matrices).

TLDR: the sampling code definitely needs revisiting, but it'll require a community PR.

Balandat commented 7 months ago

We have wanted to replace the sampling code with something akin to pathwise sampling

FWIW, we have implemented support for pathwise sampling in BoTorch, but it doesn't support generic GPyTorch models at this point: https://github.com/pytorch/botorch/tree/main/botorch/sampling/pathwise