Closed tobias-kirschstein closed 9 months ago
This call tends to be a catch-all for errors that occur on the OpenGL side. Both OpenGL and Cuda are by default asynchronous, i.e., most calls are just pushed into a queue and executed in the background as soon as possible. However, this call apparently forces a synchronization point, causing errors to bunch up here (and be reported with a generic error code).
Updating (or downgrading) graphics drivers and/or Cuda toolkit are the only simple remedies I can think of. Given that the issue is related to OpenGL, the graphics drivers are perhaps the more likely culprit here.
If you want to dig deeper, commenting out parts of the OpenGL code may help in narrowing down what exactly is causing the problem. In theory, peppering the code with glFinish()
calls might uncover any pending errors by forcing OpenGL synchronization, but I don't know if that can be trusted.
Thanks for the quick response. I tried different versions of Cuda and updating the graphics driver but to no avail.
Also, I noticed that the compilation takes place in C:\Users\$USER\AppData\Local\torch_extensions\torch_extensions\Cache\py38_cu116\nvdiffrast_plugin_gl
In the build.ninja
file, the nvcc
field was actually pointing towards a different CUDA installation on the machine (a global one) than I was expecting (the nvcc of the conda environment). I could force the compilation process to use the environment's nvcc instead by setting CUDA_HOME
but that apparently wasn't the root of the 304
error as I still get it :/
Are there any other places where CUDA version mismatches could occur? Regarding OpenGL, I have no clue how I could have gotten wrong library versions, GPU Caps Viewer shows me GL_VERSION: 4.6.0 NVIDIA 537.13
which is the version from the newest nvidia driver and sounds about right.
Both driver versions 536.67 and 537.17 seem to work for me. For reference, here's my complete setup:
Overall it's difficult to guess what might cause the problem. If possible, upgrading to CUDA 11.8 and PyTorch 2.0.0+cu118 would be the next thing I'd try, unless you have done so already.
We have never seen the MSVC version matter, as long as the extension compilation succeeds. Windows updates have also never changed behavior. The exact GPU model, as long as it's not an integrated one, also hasn't played a role in compatibility except in very unusual circumstances. So those factors are unlikely to explain why you get the failure.
Be sure to clear the torch extension cache completely after changing something. I think PyTorch is supposed to detect changes in the build environment and recompile accordingly, but we have seen mysterious bugs that were solved by clearing the cache.
I only tested with nvdiffrast/samples/triangle.py --opengl
, as I assume you get the error even there. If that works for you but a more complex program fails, it would be help a lot if you can supply a complete repro.
On a Windows machine and using the
RasterizeGLContext
, I get this error callingdr.rasterize()
:The
RasterizeCudaContext
however works fine.It is impossible for me to debug this issue as information on the web is very sparse. The only thing I could find is that the error code
304
is aCUDA_ERROR_OPERATING_SYSTEM
which is defined as "This indicates that an OS call failed." But what OS call did fail? No clue.This was also discussed in https://github.com/NVlabs/nvdiffrast/issues/87 but not resolved.
Any hints are highly appreciated.