fastai / fastai_dev

fast.ai early development experiments
Apache License 2.0
641 stars 351 forks source link

Colab Specific: RuntimeError: inverse_cuda: For batch 0: U(17437184,17437184) is zero, singular U. #286

Closed muellerzr closed 5 years ago

muellerzr commented 5 years ago

Just leaving this here so it's known about (and many others are aware of this problem). This is currently a problem with Magma, Google Colaboratory and the transforms, especially TensorPoint's among others. This problem is currently being worked on here: https://github.com/pytorch/pytorch/issues/29096

You can reproduce this by trying to do the Image Regression problem from the course in Google Colab.

I will update when it's working again. Sylvain if you want this to be closed (instead of left open) go ahead, but I felt having this here would be easier than a million questions opening up from Colab users :)

sgugger commented 5 years ago

No worries, thanks for investigating the issue!

muellerzr commented 5 years ago

The fix is done, should come out when Colab rolls the update. Otherwise I think we can do a manual update. Will update when it's ready

Finally resolved the issue today.

It turns out that the CMake for magma 2.5.1 added the line set(CUDA_SEPARABLE_COMPILATION ON), and doing separable compilation somehow triggers some deep nvcc interaction with CentOS7 / RHEL7 linkers and all of that doesn't go down very well.

After setting CUDA_SEPARABLE_COMPILATION to OFF, we can successfully build static binaries that work correctly on a K80.

We did not root cause this bug, which needs significant more time and effort, because of it's subtle nature. We do not intend to root-cause it for now, as the cost-benefit isn't very high.

The commit that fixes the issue is pytorch/builder@dd7edfd

The nightlies from tomorrow morning will carry the fix.

The fix will also naturally be part of v1.3.1, after which we will request an update Google Colab, which will fix the originally flagged bug report.

muellerzr commented 5 years ago

This issue has been solved. Recommend requiring torch 1.3.1 as this contains the fix