Open npbaskerville opened 3 years ago
This is on the pytorch side --- maybe open an issue for them?
from gpytorch.kernels import RBFKernel
import torch
y = torch.randn(100, 1).log() # nans
kernel = RBFKernel()
resp = kernel(y).evaluate().detach()
Obviously it fails on the cpu:
resp.cholesky()
RuntimeError Traceback (most recent call last)
<ipython-input-7-032bb74835fa> in <module>
----> 1 resp.cholesky()
RuntimeError: cholesky: U(1,1) is zero, singular U.
But runs fine on the gpu:
resp.to(torch.device("cuda:7")).cholesky()
# okay
tensor([[nan, 0., 0., ..., 0., 0., 0.],
[nan, nan, 0., ..., 0., 0., 0.],
[nan, nan, nan, ..., 0., 0., 0.],
...,
[nan, nan, nan, ..., nan, 0., 0.],
[nan, nan, nan, ..., nan, nan, 0.],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:7')
Hmm so @wjmaddox the code path above is actually going through torch.linalg.cholesky_ex
via psd_safe_cholesky
. It looks like on recent torch version cholesky_ex
doesn't properly surface the info code?
>> torch.linalg.cholesky_ex(resp)
torch.return_types.linalg_cholesky_ex(
L=tensor([[nan, 0., 0., ..., 0., 0., 0.],
[nan, nan, 0., ..., 0., 0., 0.],
[nan, nan, nan, ..., 0., 0., 0.],
...,
[nan, nan, nan, ..., nan, 0., 0.],
[nan, nan, nan, ..., nan, nan, 0.],
[nan, nan, nan, ..., nan, nan, nan]]),
info=tensor(1, dtype=torch.int32))
>> torch.linalg.cholesky_ex(resp.to(torch.device("cuda")))
torch.return_types.linalg_cholesky_ex(
L=tensor([[nan, 0., 0., ..., 0., 0., 0.],
[nan, nan, 0., ..., 0., 0., 0.],
[nan, nan, nan, ..., 0., 0., 0.],
...,
[nan, nan, nan, ..., nan, 0., 0.],
[nan, nan, nan, ..., nan, nan, 0.],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0'),
info=tensor(0, device='cuda:0', dtype=torch.int32))
@mruberry any thoughts why this may happen?
cc @dme65, @sdaulton re some cholesky errors potentially not getting caught anymore...
https://github.com/pytorch/pytorch/pull/63864 looks related
https://github.com/pytorch/pytorch/pull/63864 looks related
Does it? Seems like that commit is too new for this? @npbaskerville are you using the 1.9.0 release or the latest nightly / a custom build?
Using 1.9.0 from PyPI.
Ah nevermind then. I was looking into a different issue (related to this PR). This issue can't be related if this is happening on 1.9.0 though.
Let me open a pytorch issue then.
🐛 Bug
There are cases in which code run on CPU will throw a
NanError
while the same code run on GPU will throw no error but produce nans, e.g. in training loss.To reproduce
On CPU
but on GPU:
Expected Behavior
This is my query. Is the above now expected or is it a bug? I would have thought it's best to have consistency between the CPU and GPU. The difference appears to arise from torch itself. Using version 1.8.1, one gets the
NanError
in both cases, but using 1.9.0, one gets the behaviour above. Related to https://github.com/pytorch/pytorch/issues/1810?System information
GPyTorch version: 1.5.1 PyTorch version: 1.9.0 (+cu102) OS: Google Colab notebook.