cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.58k stars 562 forks source link

[Bug] cublas RuntimeError in fantasy update test on CUDA #834

Open Balandat opened 5 years ago

Balandat commented 5 years ago

🐛 Bug

Running test.examples.test_simple_gp_regression.TestSimpleGPRegression.test_fantasy_updates routinely results in the following cublas error: RuntimeError: cublas runtime error : an invalid numeric value was used as an argument

This only happens for the cuda test, the cpu test runs fine. Also, anecdotally, I haven't seen this happen on all runs / all types of machines, but it happening pretty consistently.

To reproduce

Run the test on a cuda machine.

Stack trace/error message

> test_fantasy_updates_cuda (test.examples.test_simple_gp_regression.TestSimpleGPRegression) ... ERROR
>
> ======================================================================
> ERROR: test_fantasy_updates_cuda (test.examples.test_simple_gp_regression.TestSimpleGPRegression)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/test/examples/test_simple_gp_regression.py", line 265, in test_fantasy_updates_cuda
>     self.test_fantasy_updates(cuda=True)
>   File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/test/examples/test_simple_gp_regression.py", line 308, in test_fantasy_updates
>     test_function_predictions.mean.sum().backward()
>   File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/torch/tensor.py", line 118, in backward
>     torch.autograd.backward(self, gradient, retain_graph, create_graph)
>   File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/torch/autograd/__init__.py", line 93, in backward
>     allow_unreachable=True)  # allow_unreachable flag
> RuntimeError: cublas runtime error : an invalid numeric value was used as an argument at caffe2/aten/src/THC/THCBlas.cu:120
>
>
> ActivityProfiler - start thread
>  ** On entry to SGER   parameter number 7 had an illegal value

Expected Behavior

There shouldn't be a difference between cpu and cuda tests.

System information

gpytorch master, pytorch master, linux

jacobrgardner commented 5 years ago

I've seen this from time to time as well. I spent a while looking in to it but was never able to find a root cause because it's such a strange and low level error

Balandat commented 5 years ago

Maybe @vishwakftw has some ideas about this since this is happening down in cublas SGER?

vishwakftw commented 5 years ago

Argument 7 seems to be the other vector with which the outer product is taken. I’m not sure why this fails.

Have you made use of ger anywhere in the codebase?

jacobrgardner commented 5 years ago

We certainly don't explicitly use ger anywhere in the code base -- are there high level torch functions (e.g. torch.matmul) that dispatch to ger in certain input shape cases?

rhaps0dy commented 4 years ago

I found this when implementing a RobustmaxLikelihood (pull request coming soon). It's caused by taking the gradient of a matrix-vector multiply in a GPU. Until the Pytorch devs fix it, a valid workaround is to unsqueeze the vector, thus turning it into a matrix-matrix multiply.

I propose we close the issue since it is a Pytorch bug.

Balandat commented 4 years ago

Wow, this seems pretty serious. I guess we've been lucky to mostly do batched MVMs as matrix-matrix multiplies in the code?

rhaps0dy commented 4 years ago

Us and everyone else, since nobody found this bug until now, and it must have been there for a while 😅

Balandat commented 4 years ago

It's actually known and due to legacy reasons they are moving things over to ATen now, will comment on the other issue.