Closed rpapallas closed 10 months ago
I used a 4090 but with 1804 and a 3090 with 2004 without any issues. I think your problem is more related to nvdiffrast.
https://github.com/NVlabs/nvdiffrast
I am currently traveling right now so I cannot play around with the code on other configuration. https://github.com/NVlabs/nvdiffrast/issues/131 I found this issue, I dont think I used cuda 12.2 but 11. something. Let me know if you try cuda 11. something and it solved your problem. I remember reading that nvdiffrast can run without opengl backend, but I did not explored it, maybe this could be an other solution.
https://github.com/NVlabs/nvdiffrast/blob/main/samples/torch/pose.py#L164
This could be a simple fix when the context is created. Hope this helps, once I get home I will push a opengl config.
Hi @TontonTremblay,
Thank you for getting back to me. Interestingly, when installing CUDA 11.8 I get the following error:
Traceback (most recent call last):
File "/home/rafael/diff-dope/examples/simple_scene.py", line 14, in main
ddope = dd.DiffDope(cfg=cfg)
File "<string>", line 9, in __init__
File "/home/rafael/diff-dope/diffdope/diffdope.py", line 1312, in __post_init__
self.object3d = Object3D(**self.cfg.object3d)
File "/home/rafael/diff-dope/diffdope/diffdope.py", line 980, in __init__
self.set_pose(
File "/home/rafael/diff-dope/diffdope/diffdope.py", line 1036, in set_pose
self.mesh.cuda()
File "/home/rafael/diff-dope/diffdope/diffdope.py", line 913, in cuda
vars(self)[key] = vars(self)[key].cuda()
File "/home/rafael/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
torch._C._cuda_init()
RuntimeError: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.
It seems the PyTorch code used requires a more recent version of CUDA?
Hi @TontonTremblay,
I apologize for the follow-up messages, I think I managed to get closer but still face run time issues. Specifically:
rafael@server:~/diff-dope/home/diff-dope$ python3 examples/simple_scene.py
[2024-01-09 14:19:07,568][diffdope.diffdope][INFO] - loaded mesh @data/example/mesh/AlphabetSoup.ply. Does it have texture map? True
[2024-01-09 14:19:07,570][diffdope.diffdope][INFO] - translation loaded: [-1.6116878 -2.0622094 -7.47151334]
[2024-01-09 14:19:07,571][diffdope.diffdope][INFO] - rotation loaded as quaternion: [ 0.28427788 -0.34248786 0.88225564 -0.15333994]
[2024-01-09 14:19:07,705][diffdope.diffdope][INFO] - Loaded image data/example/scene/rgb.png, shape: torch.Size([540, 960, 3])
[2024-01-09 14:19:07,727][diffdope.diffdope][INFO] - Loaded image data/example/scene/depth.png, shape: torch.Size([540, 960])
[2024-01-09 14:19:07,749][diffdope.diffdope][INFO] - Loaded image data/example/scene/seg.png, shape: torch.Size([540, 960, 3])
[2024-01-09 14:19:07,994][diffdope.diffdope][INFO] - batchsize is 8
[2024-01-09 14:19:07,994][diffdope.diffdope][INFO] - Object3D(
(pos): torch.Size([8]) ,[0]:[(-1.6116877794265747, -2.062209367752075, -7.471513271331787)] on cuda:0
(mesh): mesh @data/example/mesh/AlphabetSoup.ply. vtx:torch.Size([8, 8240, 3]) on cuda:0 on cuda:0
)
[2024-01-09 14:19:07,995][diffdope.diffdope][INFO] - Scene(path_img='data/example/scene/rgb.png', path_depth='data/example/scene/depth.png', path_segmentation='data/example/scene/seg.png', image_resize=0.5, tensor_rgb=torch.Size([8, 540, 960, 3]) @ data/example/scene/rgb.png on cuda:0, tensor_depth=torch.Size([8, 540, 960]) @ data/example/scene/depth.png on cuda:0, tensor_segmentation=torch.Size([8, 540, 960, 3]) @ data/example/scene/seg.png on cuda:0)
0%| | 0/61 [00:00<?, ?it/s]Using /home/rafael/.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/rafael/.cache/torch_extensions/py38_cu121/renderutils_plugin/build.ninja...
Building extension module renderutils_plugin...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/rafael/.local/lib/python3.8/site-packages/torch/include -isystem /home/rafael/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/rafael/.local/lib/python3.8/site-packages/torch/include/TH -isystem /home/rafael/.local/lib/python3.8/site-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++17 -c /home/rafael/diff-dope/home/diff-dope/diffdope/c_src/mesh.cu -o mesh.cuda.o
FAILED: mesh.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/rafael/.local/lib/python3.8/site-packages/torch/include -isystem /home/rafael/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/rafael/.local/lib/python3.8/site-packages/torch/include/TH -isystem /home/rafael/.local/lib/python3.8/site-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++17 -c /home/rafael/diff-dope/home/diff-dope/diffdope/c_src/mesh.cu -o mesh.cuda.o
nvcc fatal : Value 'c++17' is not defined for option 'std'
ninja: build stopped: subcommand failed.
0%| | 0/61 [00:00<?, ?it/s]
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/rafael/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "examples/simple_scene.py", line 17, in main
ddope.run_optimization()
File "/home/rafael/diff-dope/home/diff-dope/diffdope/diffdope.py", line 1693, in run_optimization
self.renders = render_texture_batch(
File "/home/rafael/diff-dope/home/diff-dope/diffdope/diffdope.py", line 196, in render_texture_batch
pos_clip_ja = dd.xfm_points(pos.contiguous(), final_mtx_proj)
File "/home/rafael/diff-dope/home/diff-dope/diffdope/ops.py", line 143, in xfm_points
out = _xfm_func.apply(points, matrix, True)
File "/home/rafael/.local/lib/python3.8/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/rafael/diff-dope/home/diff-dope/diffdope/ops.py", line 109, in forward
return _get_plugin().xfm_fwd(points, matrix, isPoints, False)
File "/home/rafael/diff-dope/home/diff-dope/diffdope/ops.py", line 83, in _get_plugin
torch.utils.cpp_extension.load(
File "/home/rafael/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1308, in load
return _jit_compile(
File "/home/rafael/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/rafael/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/rafael/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'renderutils_plugin'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
This is one is on a server with a 4090 card and driver 545.23.08, CUDA 12.3.
We have another machine with a 3090 card, and I get the same error now. That machine is also 20.04 with a driver 525.147.05, CUDA 12.0.
Sorry for not replying to the previous message, I was traveling.
I think the problem comes from the code we write in cuda to run torch.matmul faster.
https://github.com/NVlabs/diff-dope/blob/main/diffdope/diffdope.py#L196
Can you try replacing the dd.xfm_points(pos.contiguous(), final_mtx_proj)
to torch.matmul(pos, final_mtx_proj)
?
It is also call for the depth, https://github.com/NVlabs/diff-dope/blob/main/diffdope/diffdope.py#L208
Thanks, this gives an error about the size of the tensor saying "Expected size for first two dimensions of batch2 tensor to be [8, 3] but got [8, 4].". I tried to use the pos.contiguous()
in matmul
call but gives the same error.
Yeah this is just the wrong transpose, sorry I am not in front of a machine that I can debug easily.
I think you would want something like that
posw = torch.cat([pos, torch.ones([pos.shape[0],pos.shape[1], 1]).cuda()], axis=2)
pos_clip_ja = torch.matmul(posw,final_mtx_proj.transpose(1,2))
For depth I am not sure, when I get into the office today, I will take some time to add in the config removing using the optimization we did, so it is all pure torch. Sorry about that.
I think the problem after discussing with a colleague is that you do not have the right version of cuda.
Looking at the github issue, the first error is a standard nvdiffrast error. Switching to a RasterizeCudaContext or fix the installation (for example, looking at the nvdiffrast dockerfile for the required setup) should solve it. Looking at https://github.com/NVlabs/diff-dope/blob/main/diffdope/ops.py it is a minimal PyTorch cuda extension with nothing fancy in it, so as long as the user has the same cuda toolkit as the torch installation is using, that should be fine. You may want to add a blurb about that to the readme, like we do in nvdiffrec https://github.com/NVlabs/nvdiffrec?tab=readme-ov-file#one-time-setup-windows
this was his answer I hope it helps.
Hi Jonathan,
This helped a lot; thank you both for your time. Everything is running now. Looking forward to play with diff-dope! Thank you very much for this work and for taking the time to help the community.
Here are some notes that may help someone else in the future:
sudo apt install libglfw3-dev libgles2-mesa-dev
.Hope this helps someone else too.
I also reverted the changes we did for the matrix multiplication, it seems that it wasn't the problem so the code that runs now is the original one without those modifications.
thank you for the notes, I will add them to the readme. I really appreciate that you did not give up on the issues, I really try to make my work as accessible as possible, so this is a bummer for me that you had these issues, but I also feel like that navigating CUDA + nvidia drivers is a mess that I have little impact on.
If you could share with me which version of cuda, drivers and pytorch you ended up using that would be helpful.
To be honest, I think this wasn't issue of the code but rather of NVIDIA drivers, CUDA, and PyTorch configuration on my side.
Here are the details:
Thanks again.
Hello,
Thank you for sharing this work!
I am trying to get this to work on my machine and I get the following error:
I have the following set up:
I had a look around, and it seems to be a conflict between CUDA and OpenGL. I feel that RTX 4090 is a good one, and it should have worked. Which Ubuntu version are you working with?