Open jlko opened 2 years ago
I suspect this might be due to the 3080 in PyTorch using tf32 by default, which has lower precision than FP32 and so computations just arenāt accurate enough. Try disabling using tf32 in PyTorch: https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices
Hi @Balandat,
thanks for the quick reply. Turns out, you are spot on!
Setting torch.backends.cuda.matmul.allow_tf32 = False
gives the desired behaviour: cpu and gpu results look the same.
It might be worth adding a warning to gpytorch if this is set to True
for Ampere devices; it really does not work at all without it. (Or maybe gpytorch should just set this to true?)
I'll let you close the issue if you think this is not needed.
Thanks a lot :) Jannik
Yeah this is a known gotcha also in other contexts and there is some discussion (championed by @mruberry to potentially move the default to be FP32 rather than TF32 on Ampere.
Adding a warning to gpytorch is probably a good idea though I faintly remember that @jacobrgardner looked into this and it wasn't super straightforward/elegant (though I might be misremembering this).
We'll have an update on the TF32->FP32 default soon, too.
š Bug
The predictions of the GP become extremely noisy (wrong) when using the GPU acceleration. This behaviour becomes more pronounced as the likelihood noise of the GP is decreased.
This bug seems to be related to the hardware/python versions that I am running it on. We did not observe this behaviour on other machines with different GPUs.
The GPU is a NVIDIA GeForce RTX 3080 and I'm using PyTorch 1.11. The 3080 used to be difficult to get working with CUDA support (needed the pytorch nightly) but that is no longer the case. I'm not sure why this is going wrong. The 3080 has never been behaving weirdly (after getting CUDA to work) in the past.
Maybe my pip install gpytorch does not play nice with the conda pytorch here?
To reproduce
The script below initialises a simple exact GP model. We then sample random data, condition the GP on this data, and plot the posterior predictive distribution. When GP inference happens on the GPU, predictions are wildly noisy and different to the CPU-based predictions, especially if the likelihood.noise is small.
Stack trace/error message
The GP posterior predictives are extremely noisy when using the cuda device. Everything is fine on the CPU.
As the likelihood noise value increases, the differences become less noticeable.
Here are some of the plots this script creates for different noise values:
I do sometimes get the following warning when using the GPU:
Note that I did not get this warning for any of the above plots.
Expected Behavior
The predictions from the GPU and CPU should be identical. The above script is able to produce the desired behaviour on other systems.
System information
gpytorch version: 1.6.0
PyTorch version: 1.11.0
OS: Ubuntu 20.04.4, 5.13.0-37-generic kernel
GPU: NVIDIA GeForce RTX 3080
GPU Driver Version: 510.47.03
CUDA Version: 11.6
Additional context
I've also observed errors with other gpytorch functions on this GPU when conditioning on observed data. This was an actual error related to the Cholesky factorization. Again, it all worked well on the CPU.
Also: Based on this post https://github.com/cornellius-gp/gpytorch/issues/728, I've tried predicting using
with gpytorch.settings.fast_computations(covar_root_decomposition=True, log_prob=True, solves=False)
or some of the other settings mentioned there, but nothing helped!Below is a pip-freeze and my conda env setup: pip-freeze.txt environment.yaml
[Edit]: Updated b/c I've changed PyTorch versions from 1.12.dev... to the release 1.11. Problems appear with both setups.