cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.58k stars 562 forks source link

[Bug] GP Predictions on GPU are extremely noisy for small likelihood noise #1960

Open jlko opened 2 years ago

jlko commented 2 years ago

šŸ› Bug

The predictions of the GP become extremely noisy (wrong) when using the GPU acceleration. This behaviour becomes more pronounced as the likelihood noise of the GP is decreased.

This bug seems to be related to the hardware/python versions that I am running it on. We did not observe this behaviour on other machines with different GPUs.

The GPU is a NVIDIA GeForce RTX 3080 and I'm using PyTorch 1.11. The 3080 used to be difficult to get working with CUDA support (needed the pytorch nightly) but that is no longer the case. I'm not sure why this is going wrong. The 3080 has never been behaving weirdly (after getting CUDA to work) in the past.

Maybe my pip install gpytorch does not play nice with the conda pytorch here?

To reproduce

The script below initialises a simple exact GP model. We then sample random data, condition the GP on this data, and plot the posterior predictive distribution. When GP inference happens on the GPU, predictions are wildly noisy and different to the CPU-based predictions, especially if the likelihood.noise is small.

import matplotlib.pyplot as plt
import torch
import gpytorch

def dcn(tensor):
    return tensor.detach().cpu().numpy()

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

def get_gp_model(x, y, noise):
    likelihood = gpytorch.likelihoods.GaussianLikelihood()
    likelihood.noise = noise
    model = ExactGPModel(x, y, likelihood)
    return model, likelihood

def main():
    train_features = torch.rand(5)
    train_targets = torch.rand(5)

    noise = 1e-4

    print(
        f'Running with noise {noise}. '
        f'For small noise <=1e-3, things get ugly on cuda.')

    plt.figure()
    plt.scatter(
        dcn(train_features), dcn(train_targets),
        label='train', color=f'C2', zorder=20)

    for i, device in enumerate(['cpu', 'cuda']):
        train_features = train_features.to(device)
        train_targets = train_targets.to(device)
        test_features = torch.linspace(0, 1, 100, device=device)

        gp_model, gp_likelihood = get_gp_model(
            train_features, train_targets, noise)
        gp_model = gp_model.to(device)
        gp_likelihood = gp_likelihood.to(device)
        gp_model.eval()
        gp_likelihood.eval()

        with torch.no_grad():
            gp_pred = gp_likelihood(gp_model(test_features))
            gp_mean = gp_pred.mean
            gp_lower, gp_upper = gp_pred.confidence_region()

        plt.plot(
            dcn(test_features), dcn(gp_mean),
            label=f'pred {device}', c=f'C{i}', zorder=10)
        plt.fill_between(
            dcn(test_features), dcn(gp_upper), dcn(gp_lower),
            alpha=0.3, color=f'C{i}')

    plt.legend()
    plt.show()
    plt.savefig('tmp.png')

if __name__ == '__main__':
    main()

Stack trace/error message

The GP posterior predictives are extremely noisy when using the cuda device. Everything is fine on the CPU.

As the likelihood noise value increases, the differences become less noticeable.

Here are some of the plots this script creates for different noise values:

image

image

image

I do sometimes get the following warning when using the GPU:

miniconda3/envs/meta-npt/lib/python3.8/site-packages/gpytorch/distributions/multivariate_normal.py:259: NumericalWarning: Negative variance values detected. This is likely due to numerical instabilities. Rounding negative variances up to 1e-06.

Note that I did not get this warning for any of the above plots.

Expected Behavior

The predictions from the GPU and CPU should be identical. The above script is able to produce the desired behaviour on other systems.

System information

Additional context

I've also observed errors with other gpytorch functions on this GPU when conditioning on observed data. This was an actual error related to the Cholesky factorization. Again, it all worked well on the CPU.

Also: Based on this post https://github.com/cornellius-gp/gpytorch/issues/728, I've tried predicting using with gpytorch.settings.fast_computations(covar_root_decomposition=True, log_prob=True, solves=False) or some of the other settings mentioned there, but nothing helped!

Below is a pip-freeze and my conda env setup: pip-freeze.txt environment.yaml

[Edit]: Updated b/c I've changed PyTorch versions from 1.12.dev... to the release 1.11. Problems appear with both setups.

Balandat commented 2 years ago

I suspect this might be due to the 3080 in PyTorch using tf32 by default, which has lower precision than FP32 and so computations just arenā€™t accurate enough. Try disabling using tf32 in PyTorch: https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices

jlko commented 2 years ago

Hi @Balandat,

thanks for the quick reply. Turns out, you are spot on!

Setting torch.backends.cuda.matmul.allow_tf32 = False gives the desired behaviour: cpu and gpu results look the same.

It might be worth adding a warning to gpytorch if this is set to True for Ampere devices; it really does not work at all without it. (Or maybe gpytorch should just set this to true?)

I'll let you close the issue if you think this is not needed.

Thanks a lot :) Jannik

Balandat commented 2 years ago

Yeah this is a known gotcha also in other contexts and there is some discussion (championed by @mruberry to potentially move the default to be FP32 rather than TF32 on Ampere.

Adding a warning to gpytorch is probably a good idea though I faintly remember that @jacobrgardner looked into this and it wasn't super straightforward/elegant (though I might be misremembering this).

mruberry commented 2 years ago

We'll have an update on the TF32->FP32 default soon, too.