Closed djsamseng closed 2 years ago
If the crashing line is correct, this is a bit of a mystery. The issue isn't that the extension function isn't found -- that would not cause a segmentation fault. Instead, it looks like the locally defined variables to store the function pointers are invalid, which should not be possible, as they are defined right there in the same file, glutil.cpp
lines 19-21.
The one thing I can think of is that your OpenGL headers are somehow nonstandard and define GLAPIENTRY
in a way that breaks the definitions of the variables that will contain the function pointers. Could it be that there is some graphics driver package that you haven't installed, that would include the current OpenGL headers?
The compiler warnings might give more hints. To see those, set verbose=True
on this line, clear the torch C++ extension cache, and run again.
Created a test c++ program invoking the same code and compiled it and it worked correctly
c++ -isystem /usr/local/cuda-11.4/include -fPIC -std=c++14 -c /home/samuel/dev/nvdiffrast/nvdiffrast/common/test.cpp -o test.o
c++ test.o -lGL -lEGL -L/home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda-11.4/lib64 -lcudart -o test.exe
./test.exe
Before ====
Did set
Did set
...
End ====
Installing nvdiffrec outside of the anacoda environment (using the system default python 3.8) via "pip3 install "git+https://github.com/NVlabs/nvdiffrast" has the same crash "Segmentation fault (core dumped)". My next guess is that torch with cudatoolkit=11.3
isn't playing nice with my installed cuda toolkit 11.4
but no dice with CUDA_HOME=/usr/local/cuda-11.3 python3 samples/torch/triangle.py
either
The command line args for compiling nvdiffrast_plugin was
c++ -MMD -MF common.o.d -DTORCH_EXTENSION_NAME=nvdiffrast_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/include -isystem /home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/include/TH -isystem /home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.4/include -isystem /home/samuel/anaconda3/envs/39/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -DNVDR_TORCH -c /home/samuel/dev/nvdiffrast/nvdiffrast/common/common.cpp -o common.o
...
[14/14] c++ common.o glutil.o rasterize.cuda.o rasterize.o interpolate.cuda.o texture.cuda.o texture.o antialias.cuda.o torch_bindings.o torch_rasterize.o torch_interpolate.o torch_texture.o torch_antialias.o -shared -lGL -lEGL -L/home/samuel/anaconda3/envs/39/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda-11.4/lib64 -lcudart -o nvdiffrast_plugin.so
Have you cleared torch's extension cache between runs? I don't know if ninja detects compiler changes, so your test with toolkit 11.3 might have used a previously compiled version of the extension.
The extension cache directory is easiest to locate by seeing what torch.utils.cpp_extension._get_build_directory('nvdiffrast_plugin', False)
returns.
To always see what happens during the compilation, you can do the verbose=True
thing in my previous comment.
@djsamseng Did you end up finding a solution to this? I'm interested because it's such a strange bug.
Unfortunately no. After reinstalling cuda drivers and still having the issue I eventually called it quits. My best guess is that it has to do with how I originally installed cuda dependencies when building OpenMPI with cuda support https://github.com/djsamseng/CudaAwareMPINumba#installation but I didn’t really want to clean out my Ubuntu install to test that theory
Likely a local environment issue - feel free to reopen if others end up running into this too
@s-laine well this got me curious again and I found out what the issue was! Thanks for your help on this (clearing the torch cache was critical).
I had this in my .bashrc from installing Mujoco as recommended to fix this issue. Unsetting this fixed the crash.
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libGLEW.so
Aha, that certainly explains it. As it happens, nvdiffrast got rid of GLEW a few releases ago because of various compatibility issues, so an old enough version might have actually worked in this environment.
Thanks for going the extra mile and finding the root cause! For the next release I'll consider adding a check in the code that shows a warning if libGLEW.so is loaded.
I just encountered the issue and found that the problem lies in the symbol name collision of glBindBuffer
between nvdiffrast_plugin.so
and libGL.so.1
. The line *pfn = results
expects the symbol to point to the function pointer in nvdiffrast, whereas in some cases (e.g. libGLEW.so
preloaded), it might point to the function in libGL.so.1
. Thus a SIGSEGV occurred because the function is not writable.
I propose two possible fixes:
glBindBuffer
to something like nvdiffrast_glBindBuffer
, and then #define glBindBuffer nvdiffrast_glBindBuffer
. This is the same way as glad
.__attribute__ ((visibility ("hidden")))
to the function pointers as suggested by https://stackoverflow.com/questions/6538501/linking-two-shared-libraries-with-some-of-the-same-symbols:// Create the function pointers.
#define GLUTIL_EXT(return_type, name, ...) return_type (GLAPIENTRY* name)(__VA_ARGS__) __attribute__ ((visibility ("hidden"))) = 0;
Hi @jiegec, thanks for the note. I'm aware of the name clash and considered doing fix 1 at some point (it's on my "maybe-to-do" list). However, I didn't know about fix 2 that seems like a much easier solution. Have you confirmed that it solves the name clash issue? I don't have a test setup with name collision at hand, but if the hidden attribute works (and doesn't break anything), I'll go ahead and add it in the next release.
Have you confirmed that it solves the name clash issue?
Yes, and I confirm that in my environment, it crashes without the fix and the fix solves the issue.
Excellent, I'll include this in the next release. Thanks!
Reproduction:
Digging down into this the crash occurs on this line. It's first starting with glBindBuffer but even if I comment those ones out it crashes on further ones.
Ubuntu 20.04.4 Nvidia RTX 3090 Cuda toolkit 11.4 python 3.9 pytorch 1.11 (from
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
)which seems that it's supported from the nvdiffrec installation steps?
Any ideas why this is crashing? Thanks!