cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.53k stars 554 forks source link

[Bug] Different nan handling under GPU and CPU #1747

Open npbaskerville opened 3 years ago

npbaskerville commented 3 years ago

🐛 Bug

There are cases in which code run on CPU will throw a NanError while the same code run on GPU will throw no error but produce nans, e.g. in training loss.

To reproduce

import math 

import gpytorch 
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

# Training data is 100 points in [0,1] inclusive regularly spaced
train_x = torch.linspace(0, 1, 100).to(device)
# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()).to(device) * math.sqrt(0.04)

# Intentionally corrupt the train_y to give nans
train_y = train_y.log()
print(train_y)

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood).to(device)

training_iter = 10

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)  # Includes GaussianLikelihood parameters

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iter):
    # Zero gradients from previous iteration
    optimizer.zero_grad()
    # Output from model
    output = model(train_x)
    # Calc loss and backprop gradients
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
        i + 1, training_iter, loss.item(),
        model.covar_module.base_kernel.lengthscale.item(),
        model.likelihood.noise.item()
    ))
    optimizer.step()

On CPU

Iter 1/10 - Loss: nan   lengthscale: 0.693   noise: 0.693
---------------------------------------------------------------------------
NanError                                  Traceback (most recent call last)
<ipython-input-17-ae7f611b728d> in <module>()
     17     output = model(train_x)
     18     # Calc loss and backprop gradients
---> 19     loss = -mll(output, train_y)
     20     loss.backward()
     21     print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (

8 frames
/usr/local/lib/python3.7/dist-packages/gpytorch/utils/cholesky.py in _psd_safe_cholesky(A, out, jitter, max_tries)
     29         if isnan.any():
     30             raise NanError(
---> 31                 f"cholesky_cpu: {isnan.sum().item()} of {A.numel()} elements of the {A.shape} tensor are NaN."
     32             )
     33 

NanError: cholesky_cpu: 10000 of 10000 elements of the torch.Size([100, 100]) tensor are NaN.

but on GPU:

Iter 1/10 - Loss: nan   lengthscale: 0.693   noise: 0.693
Iter 2/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 3/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 4/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 5/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 6/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 7/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 8/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 9/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 10/10 - Loss: nan   lengthscale: nan   noise: nan

Expected Behavior

This is my query. Is the above now expected or is it a bug? I would have thought it's best to have consistency between the CPU and GPU. The difference appears to arise from torch itself. Using version 1.8.1, one gets the NanError in both cases, but using 1.9.0, one gets the behaviour above. Related to https://github.com/pytorch/pytorch/issues/1810?

System information

GPyTorch version: 1.5.1 PyTorch version: 1.9.0 (+cu102) OS: Google Colab notebook.

wjmaddox commented 3 years ago

This is on the pytorch side --- maybe open an issue for them?

from gpytorch.kernels import RBFKernel
import torch

y = torch.randn(100, 1).log() # nans
kernel = RBFKernel()
resp = kernel(y).evaluate().detach()

Obviously it fails on the cpu:

resp.cholesky()
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-032bb74835fa> in <module>
----> 1 resp.cholesky()

RuntimeError: cholesky: U(1,1) is zero, singular U.

But runs fine on the gpu:

resp.to(torch.device("cuda:7")).cholesky()
# okay
tensor([[nan, 0., 0.,  ..., 0., 0., 0.],
        [nan, nan, 0.,  ..., 0., 0., 0.],
        [nan, nan, nan,  ..., 0., 0., 0.],
        ...,
        [nan, nan, nan,  ..., nan, 0., 0.],
        [nan, nan, nan,  ..., nan, nan, 0.],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:7')
Balandat commented 3 years ago

Hmm so @wjmaddox the code path above is actually going through torch.linalg.cholesky_ex via psd_safe_cholesky. It looks like on recent torch version cholesky_ex doesn't properly surface the info code?

>> torch.linalg.cholesky_ex(resp)

torch.return_types.linalg_cholesky_ex(
L=tensor([[nan, 0., 0.,  ..., 0., 0., 0.],
        [nan, nan, 0.,  ..., 0., 0., 0.],
        [nan, nan, nan,  ..., 0., 0., 0.],
        ...,
        [nan, nan, nan,  ..., nan, 0., 0.],
        [nan, nan, nan,  ..., nan, nan, 0.],
        [nan, nan, nan,  ..., nan, nan, nan]]),
info=tensor(1, dtype=torch.int32))

>> torch.linalg.cholesky_ex(resp.to(torch.device("cuda")))

torch.return_types.linalg_cholesky_ex(
L=tensor([[nan, 0., 0.,  ..., 0., 0., 0.],
        [nan, nan, 0.,  ..., 0., 0., 0.],
        [nan, nan, nan,  ..., 0., 0., 0.],
        ...,
        [nan, nan, nan,  ..., nan, 0., 0.],
        [nan, nan, nan,  ..., nan, nan, 0.],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0'),
info=tensor(0, device='cuda:0', dtype=torch.int32))

@mruberry any thoughts why this may happen?

Balandat commented 3 years ago

cc @dme65, @sdaulton re some cholesky errors potentially not getting caught anymore...

sdaulton commented 3 years ago

https://github.com/pytorch/pytorch/pull/63864 looks related

Balandat commented 3 years ago

https://github.com/pytorch/pytorch/pull/63864 looks related

Does it? Seems like that commit is too new for this? @npbaskerville are you using the 1.9.0 release or the latest nightly / a custom build?

npbaskerville commented 3 years ago

Using 1.9.0 from PyPI.

sdaulton commented 3 years ago

Ah nevermind then. I was looking into a different issue (related to this PR). This issue can't be related if this is happening on 1.9.0 though.

Balandat commented 3 years ago

Let me open a pytorch issue then.