Open Balandat opened 5 years ago
I've seen this from time to time as well. I spent a while looking in to it but was never able to find a root cause because it's such a strange and low level error
Maybe @vishwakftw has some ideas about this since this is happening down in cublas SGER?
Argument 7 seems to be the other vector with which the outer product is taken. I’m not sure why this fails.
Have you made use of ger
anywhere in the codebase?
We certainly don't explicitly use ger
anywhere in the code base -- are there high level torch functions (e.g. torch.matmul
) that dispatch to ger
in certain input shape cases?
I found this when implementing a RobustmaxLikelihood
(pull request coming soon). It's caused by taking the gradient of a matrix-vector multiply in a GPU. Until the Pytorch devs fix it, a valid workaround is to unsqueeze the vector, thus turning it into a matrix-matrix multiply.
I propose we close the issue since it is a Pytorch bug.
Wow, this seems pretty serious. I guess we've been lucky to mostly do batched MVMs as matrix-matrix multiplies in the code?
Us and everyone else, since nobody found this bug until now, and it must have been there for a while 😅
It's actually known and due to legacy reasons they are moving things over to ATen now, will comment on the other issue.
🐛 Bug
Running
test.examples.test_simple_gp_regression.TestSimpleGPRegression.test_fantasy_updates
routinely results in the following cublas error:RuntimeError: cublas runtime error : an invalid numeric value was used as an argument
This only happens for the cuda test, the cpu test runs fine. Also, anecdotally, I haven't seen this happen on all runs / all types of machines, but it happening pretty consistently.
To reproduce
Run the test on a cuda machine.
Stack trace/error message
Expected Behavior
There shouldn't be a difference between cpu and cuda tests.
System information
gpytorch master, pytorch master, linux