NVlabs / nvdiffrast

Nvdiffrast - Modular Primitives for High-Performance Differentiable Rendering
Other
1.29k stars 139 forks source link

Cuda error 304 #131

Closed tobias-kirschstein closed 9 months ago

tobias-kirschstein commented 10 months ago

On a Windows machine and using the RasterizeGLContext, I get this error calling dr.rasterize():

RuntimeError: Cuda error: 304[cudaGraphicsGLRegisterBuffer(&s.cudaPosBuffer, s.glPosBuffer, cudaGraphicsRegisterFlagsWriteDiscard);]

The RasterizeCudaContext however works fine.

It is impossible for me to debug this issue as information on the web is very sparse. The only thing I could find is that the error code 304 is a CUDA_ERROR_OPERATING_SYSTEM which is defined as "This indicates that an OS call failed." But what OS call did fail? No clue.

This was also discussed in https://github.com/NVlabs/nvdiffrast/issues/87 but not resolved.

Any hints are highly appreciated.

s-laine commented 10 months ago

This call tends to be a catch-all for errors that occur on the OpenGL side. Both OpenGL and Cuda are by default asynchronous, i.e., most calls are just pushed into a queue and executed in the background as soon as possible. However, this call apparently forces a synchronization point, causing errors to bunch up here (and be reported with a generic error code).

Updating (or downgrading) graphics drivers and/or Cuda toolkit are the only simple remedies I can think of. Given that the issue is related to OpenGL, the graphics drivers are perhaps the more likely culprit here.

If you want to dig deeper, commenting out parts of the OpenGL code may help in narrowing down what exactly is causing the problem. In theory, peppering the code with glFinish() calls might uncover any pending errors by forcing OpenGL synchronization, but I don't know if that can be trusted.

tobias-kirschstein commented 10 months ago

Thanks for the quick response. I tried different versions of Cuda and updating the graphics driver but to no avail.

Also, I noticed that the compilation takes place in C:\Users\$USER\AppData\Local\torch_extensions\torch_extensions\Cache\py38_cu116\nvdiffrast_plugin_gl In the build.ninja file, the nvcc field was actually pointing towards a different CUDA installation on the machine (a global one) than I was expecting (the nvcc of the conda environment). I could force the compilation process to use the environment's nvcc instead by setting CUDA_HOME but that apparently wasn't the root of the 304 error as I still get it :/

Are there any other places where CUDA version mismatches could occur? Regarding OpenGL, I have no clue how I could have gotten wrong library versions, GPU Caps Viewer shows me GL_VERSION: 4.6.0 NVIDIA 537.13 which is the version from the newest nvidia driver and sounds about right.

s-laine commented 10 months ago

Both driver versions 536.67 and 537.17 seem to work for me. For reference, here's my complete setup:

Overall it's difficult to guess what might cause the problem. If possible, upgrading to CUDA 11.8 and PyTorch 2.0.0+cu118 would be the next thing I'd try, unless you have done so already.

We have never seen the MSVC version matter, as long as the extension compilation succeeds. Windows updates have also never changed behavior. The exact GPU model, as long as it's not an integrated one, also hasn't played a role in compatibility except in very unusual circumstances. So those factors are unlikely to explain why you get the failure.

Be sure to clear the torch extension cache completely after changing something. I think PyTorch is supposed to detect changes in the build environment and recompile accordingly, but we have seen mysterious bugs that were solved by clearing the cache.

I only tested with nvdiffrast/samples/triangle.py --opengl, as I assume you get the error even there. If that works for you but a more complex program fails, it would be help a lot if you can supply a complete repro.