NVlabs / nvdiffrast

Nvdiffrast - Modular Primitives for High-Performance Differentiable Rendering
Other
1.31k stars 139 forks source link

Segmentation fault in RasterizeGLContext() #76

Closed djsamseng closed 2 years ago

djsamseng commented 2 years ago

Reproduction:

python3 samples/torch/triangle.py

Digging down into this the crash occurs on this line. It's first starting with glBindBuffer but even if I comment those ones out it crashes on further ones.

Ubuntu 20.04.4 Nvidia RTX 3090 Cuda toolkit 11.4 python 3.9 pytorch 1.11 (from conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch)

which seems that it's supported from the nvdiffrec installation steps?

glxinfo | grep "OpenGL version"
OpenGL version string: 4.6.0 NVIDIA 470.129.06

Any ideas why this is crashing? Thanks!

s-laine commented 2 years ago

If the crashing line is correct, this is a bit of a mystery. The issue isn't that the extension function isn't found -- that would not cause a segmentation fault. Instead, it looks like the locally defined variables to store the function pointers are invalid, which should not be possible, as they are defined right there in the same file, glutil.cpp lines 19-21.

The one thing I can think of is that your OpenGL headers are somehow nonstandard and define GLAPIENTRY in a way that breaks the definitions of the variables that will contain the function pointers. Could it be that there is some graphics driver package that you haven't installed, that would include the current OpenGL headers?

The compiler warnings might give more hints. To see those, set verbose=True on this line, clear the torch C++ extension cache, and run again.

djsamseng commented 2 years ago

Update

Created a test c++ program invoking the same code and compiled it and it worked correctly

c++ -isystem /usr/local/cuda-11.4/include -fPIC -std=c++14 -c /home/samuel/dev/nvdiffrast/nvdiffrast/common/test.cpp -o test.o
c++ test.o -lGL -lEGL -L/home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda-11.4/lib64 -lcudart -o test.exe
./test.exe
Before ==== 
Did set
Did set
...
End ====

Installing nvdiffrec outside of the anacoda environment (using the system default python 3.8) via "pip3 install "git+https://github.com/NVlabs/nvdiffrast" has the same crash "Segmentation fault (core dumped)". My next guess is that torch with cudatoolkit=11.3 isn't playing nice with my installed cuda toolkit 11.4 but no dice with CUDA_HOME=/usr/local/cuda-11.3 python3 samples/torch/triangle.py either

The command line args for compiling nvdiffrast_plugin was

c++ -MMD -MF common.o.d -DTORCH_EXTENSION_NAME=nvdiffrast_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/include -isystem /home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/include/TH -isystem /home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.4/include -isystem /home/samuel/anaconda3/envs/39/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -DNVDR_TORCH -c /home/samuel/dev/nvdiffrast/nvdiffrast/common/common.cpp -o common.o

...

[14/14] c++ common.o glutil.o rasterize.cuda.o rasterize.o interpolate.cuda.o texture.cuda.o texture.o antialias.cuda.o torch_bindings.o torch_rasterize.o torch_interpolate.o torch_texture.o torch_antialias.o -shared -lGL -lEGL -L/home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda-11.4/lib64 -lcudart -o nvdiffrast_plugin.so
s-laine commented 2 years ago

Have you cleared torch's extension cache between runs? I don't know if ninja detects compiler changes, so your test with toolkit 11.3 might have used a previously compiled version of the extension.

The extension cache directory is easiest to locate by seeing what torch.utils.cpp_extension._get_build_directory('nvdiffrast_plugin', False) returns.

To always see what happens during the compilation, you can do the verbose=True thing in my previous comment.

s-laine commented 2 years ago

@djsamseng Did you end up finding a solution to this? I'm interested because it's such a strange bug.

djsamseng commented 2 years ago

Unfortunately no. After reinstalling cuda drivers and still having the issue I eventually called it quits. My best guess is that it has to do with how I originally installed cuda dependencies when building OpenMPI with cuda support https://github.com/djsamseng/CudaAwareMPINumba#installation but I didn’t really want to clean out my Ubuntu install to test that theory

djsamseng commented 2 years ago

Likely a local environment issue - feel free to reopen if others end up running into this too

djsamseng commented 2 years ago

@s-laine well this got me curious again and I found out what the issue was! Thanks for your help on this (clearing the torch cache was critical).

I had this in my .bashrc from installing Mujoco as recommended to fix this issue. Unsetting this fixed the crash.

export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libGLEW.so
s-laine commented 2 years ago

Aha, that certainly explains it. As it happens, nvdiffrast got rid of GLEW a few releases ago because of various compatibility issues, so an old enough version might have actually worked in this environment.

Thanks for going the extra mile and finding the root cause! For the next release I'll consider adding a check in the code that shows a warning if libGLEW.so is loaded.

jiegec commented 1 year ago

I just encountered the issue and found that the problem lies in the symbol name collision of glBindBuffer between nvdiffrast_plugin.so and libGL.so.1. The line *pfn = results expects the symbol to point to the function pointer in nvdiffrast, whereas in some cases (e.g. libGLEW.so preloaded), it might point to the function in libGL.so.1. Thus a SIGSEGV occurred because the function is not writable.

I propose two possible fixes:

  1. Rename glBindBuffer to something like nvdiffrast_glBindBuffer, and then #define glBindBuffer nvdiffrast_glBindBuffer. This is the same way as glad.
  2. Add __attribute__ ((visibility ("hidden"))) to the function pointers as suggested by https://stackoverflow.com/questions/6538501/linking-two-shared-libraries-with-some-of-the-same-symbols:
// Create the function pointers.
#define GLUTIL_EXT(return_type, name, ...) return_type (GLAPIENTRY* name)(__VA_ARGS__) __attribute__ ((visibility ("hidden"))) = 0;
s-laine commented 1 year ago

Hi @jiegec, thanks for the note. I'm aware of the name clash and considered doing fix 1 at some point (it's on my "maybe-to-do" list). However, I didn't know about fix 2 that seems like a much easier solution. Have you confirmed that it solves the name clash issue? I don't have a test setup with name collision at hand, but if the hidden attribute works (and doesn't break anything), I'll go ahead and add it in the next release.

jiegec commented 1 year ago

Have you confirmed that it solves the name clash issue?

Yes, and I confirm that in my environment, it crashes without the fix and the fix solves the issue.

s-laine commented 1 year ago

Excellent, I'll include this in the next release. Thanks!