sunkymepro commented 2 years ago

When I run codes in ./samples/torch，there is always an error: No module named 'nvdiffrast_plugin'

Traceback (most recent call last): File "triangle.py", line 21, in glctx = dr.RasterizeGLContext() File "/opt/conda/envs/fomm/lib/python3.7/site-packages/nvdiffrast/torch/ops.py", line 142, in init self.cpp_wrapper = _get_plugin().RasterizeGLStateWrapper(output_db, mode == 'automatic') File "/opt/conda/envs/fomm/lib/python3.7/site-packages/nvdiffrast/torch/ops.py", line 83, in _get_plugin torch.utils.cpp_extension.load(name=plugin_name, sources=source_paths, extra_cflags=opts, extra_cuda_cflags=opts, extra_ldflags=ldflags, with_cuda=True, verbose=False) File "/opt/conda/envs/fomm/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1091, in load keep_intermediates=keep_intermediates) File "/opt/conda/envs/fomm/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1317, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/opt/conda/envs/fomm/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1706, in _import_module_from_library file, path, description = imp.find_module(module_name, [path]) File "/opt/conda/envs/fomm/lib/python3.7/imp.py", line 299, in find_module raise ImportError(_ERR_MSG.format(name), name=name) ImportError: No module named 'nvdiffrast_plugin'

It seems like that some packages are lost. I install nvdiffrast as the instruction in document ----cd ./nvdiffrast and pip install . I uninstall and install many times but this error still exists. I try installing in cuda10.0, torch 1.6, cuda11.1, torch 1.8.1, and Cuda 9.0, torch 1.6, but all these situations have this error. I use an Nvidia 3090 GPU. Is there anyone who can solve this problem? Thanks.

sunkymepro commented 2 years ago

I install nvdiffrast in my own docker and I install dependencies as the Dockerfile, but this issue still exists.

s-laine commented 2 years ago

It looks like the building of plugin somehow fails silently. This should not happen with the ninja build system, and without an error message telling what went wrong, it is difficult to debug the issue.

Just to double check: Are you seeing this behavior using the provided docker setup or only in your own?

HarshWinterBytes commented 2 years ago

I also meet this problem! Could someone tell me how to solve this problem?

s-laine commented 2 years ago

Hi @LCY850729436, can you be a bit more specific? Is this with the Docker configuration provided by us, or in a different environment? If latter, do you have the Ninja build system installed?

HarshWinterBytes commented 2 years ago

Hi @LCY850729436, can you be a bit more specific? Is this with the Docker configuration provided by us, or in a different environment? If latter, do you have the Ninja build system installed?

I have solved this problem. I think the problem should be the version adaptation of GPU to CUDA. This problem occurs when I use 2080ti, but not when I use Titan.

sunkymepro commented 2 years ago

Hi @LCY850729436, can you be a bit more specific? Is this with the Docker configuration provided by us, or in a different environment? If latter, do you have the Ninja build system installed?

I have solved this problem. I think the problem should be the version adaptation of GPU to CUDA. This problem occurs when I use 2080ti, but not when I use Titan.

I use a 3090 GPU

xjcvip007 commented 2 years ago

I use two 2080ti on docker, same problem occured!

s-laine commented 2 years ago

Hi everyone,

I'm eager to help in solving this problem, but more information is needed of what exactly goes wrong. We know there are plenty of working installations out there, so something must be different in the setups that exhibit this problem.

To start, I repeat my question to everyone that experiences this problem: Is this with the Docker configuration provided by us, or in a different environment? If latter, do you have the Ninja build system installed?

Second, I would like to ask you to change verbose=False to verbose=True in the call to torch.utils.cpp_extension.load in nvdiffrast/torch/ops.py line 84, and share the output.

Finally, if someone has seen this problem and found a way to fix it, please share your solution. The error indicates that the nvdiffrast C++/Cuda plugin could not be loaded, and the most likely reason is that it could not be compiled. I imagine this could occur for a variety of reasons, and therefore there could be multiple different root causes for the same issue.

xjcvip007 commented 2 years ago

Hi @s-laine, I use the Docker conf provided by you as below:

ARG BASE_IMAGE=pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel FROM $BASE_IMAGE

RUN apt-get update && apt-get install -y --no-install-recommends \ pkg-config \ libglvnd0 \ libgl1 \ libglx0 \ libegl1 \ libgles2 \ libglvnd-dev \ libgl1-mesa-dev \ libegl1-mesa-dev \ libgles2-mesa-dev \ cmake \ curl \ build-essential \ git \ curl \ vim \ wget \ ca-certificates \ libjpeg-dev \ libpng-dev \
apt-utils \ bzip2 \
tmux \ gcc \ g++ \ openssh-server \ software-properties-common \ xauth \ zip \ unzip \ && apt-get clean \ && rm -rf /var/lib/apt/lists/*

x forward update

RUN echo "X11UseLocalhost no" >> /etc/ssh/sshd_config \ && mkdir -p /run/sshd

ENV PYTHONDONTWRITEBYTECODE=1 ENV PYTHONUNBUFFERED=1

for GLEW

ENV LD_LIBRARY_PATH /usr/lib64:$LD_LIBRARY_PATH

nvidia-container-runtime

ENV NVIDIA_VISIBLE_DEVICES all ENV NVIDIA_DRIVER_CAPABILITIES compute,utility,graphics

Default pyopengl to EGL for good headless rendering support

ENV PYOPENGL_PLATFORM egl

COPY docker/10_nvidia.json /usr/share/glvnd/egl_vendor.d/10_nvidia.json

RUN pip install -i https://pypi.tuna.tsinghua.edu.cn/simple imageio imageio-ffmpeg

COPY nvdiffrast /tmp/pip/nvdiffrast/ COPY README.md setup.py /tmp/pip/ RUN cd /tmp/pip && pip install .

And when I run 'triangle.py' the importError will happen. I set verbose=True as you suggest, the errors show as follow:

s-laine commented 2 years ago

@xjcvip007, thank you for the information. It appears that you are not running the Dockerfile provided in our repo, as the base image in yours is pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel, whereas in ours it is pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel. The block with #x forward update is also not from our Dockerfile.

Can you try the same experiment with a container built using our Dockerfile?

xjcvip007 commented 2 years ago

@s-laine, I can not use your default dockerfile for our gpu cloud platform support, so we change the base image from pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel to pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel, and add some installation for our sshd support, but all needed config and file included in the dockerfile.

s-laine commented 2 years ago

I tried this with a Linux machine, and I'm unfortunately unable to replicate the problem even when using your Dockerfile (with the missing backslashes added, and imageio/imageio-ffmpeg installed from the default source).

My test machine has the following operating system, as reported by uname -a: Linux <hostname> 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

And nvidia-smi reports the following version information:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+

As the container looks to be fine, I'm suspecting you may have outdated graphics drivers, because those depend on the host operating system instead of the container. Alternatively, building the container does not produce the same result for one reason or another, but I don't know enough about docker to tell why this might happen. What I don't understand is why there are no useful error messages so I still don't know what exactly fails when you try to run the example.

For reference, below is the exact Dockerfile that I used:

ARG BASE_IMAGE=pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel
FROM $BASE_IMAGE

RUN apt-get update && apt-get install -y --no-install-recommends \
pkg-config \
libglvnd0 \
libgl1 \
libglx0 \
libegl1 \
libgles2 \
libglvnd-dev \
libgl1-mesa-dev \
libegl1-mesa-dev \
libgles2-mesa-dev \
cmake \
curl \
build-essential \
git \
curl \
vim \
wget \
ca-certificates \
libjpeg-dev \
libpng-dev \
apt-utils \
bzip2 \
tmux \
gcc \
g++ \
openssh-server \
software-properties-common \
xauth \
zip \
unzip \
&& apt-get clean

#x forward update
RUN echo "X11UseLocalhost no" >> /etc/ssh/sshd_config \
&& mkdir -p /run/sshd

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

#for GLEW
ENV LD_LIBRARY_PATH /usr/lib64:$LD_LIBRARY_PATH

#nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility,graphics

#Default pyopengl to EGL for good headless rendering support
ENV PYOPENGL_PLATFORM egl

COPY docker/10_nvidia.json /usr/share/glvnd/egl_vendor.d/10_nvidia.json

RUN pip install imageio imageio-ffmpeg
#RUN pip install -i https://pypi.tuna.tsinghua.edu.cn/simple imageio imageio-ffmpeg

COPY nvdiffrast /tmp/pip/nvdiffrast/
COPY README.md setup.py /tmp/pip/
RUN cd /tmp/pip && pip install .

I built the container with ./run_sample.sh --build-container and executed the sample with ./run_sample.sh ./samples/torch/triangle.py.

I also tried launching a shell into the container by doing

docker run --rm -it --gpus all -v `pwd`:/app --workdir /app -e TORCH_EXTENSIONS_DIR=/app/tmp gltorch:latest bash

and running python samples/torch/triangle.py manually from within it, and that worked too.

xjcvip007 commented 2 years ago

@s-laine thanks for your effort, I will try the dockerfile on the more new graphics drivers, and below is my 'nvidia-smi' result:

DoubleYanLee commented 2 years ago

Hi everyone,

I'm eager to help in solving this problem, but more information is needed of what exactly goes wrong. We know there are plenty of working installations out there, so something must be different in the setups that exhibit this problem.

To start, I repeat my question to everyone that experiences this problem: Is this with the Docker configuration provided by us, or in a different environment? If latter, do you have the Ninja build system installed?

Second, I would like to ask you to change verbose=False to verbose=True in the call to torch.utils.cpp_extension.load in nvdiffrast/torch/ops.py line 84, and share the output.

Finally, if someone has seen this problem and found a way to fix it, please share your solution. The error indicates that the nvdiffrast C++/Cuda plugin could not be loaded, and the most likely reason is that it could not be compiled. I imagine this could occur for a variety of reasons, and therefore there could be multiple different root causes for the same issue.

Hello,I've meet the same question. I didn't use docker. my environment is CUDA10.2+pytorch1.7.1+torchvision0.8.2 I installed nvdiffrast from 'pip install .' in nvdiffrast directory . When i run 'python pose.py',I got the same question'ImportError: No module named 'nvdiffrast_plugin' ' and a unique question like that :

s-laine commented 2 years ago

This appears to be an incompatibility between PyTorch and the C++ compiler in the Linux distribution. A discussion here mentions this error when trying to build PyTorch extensions on Arch Linux.

So this issue isn't specific to nvdiffrast, but prevents the building of any C++ based PyTorch extensions on your system. If PyTorch refuses to work with the compiler on the system, there unfortunately isn't anything we can do about it. We recommend using an Ubuntu distribution as that's what we have tested everything on.

bo233 commented 2 years ago

I have solved the problem. I meet the problem on Windows, and this is due to ninja fails to compile the plugin. I set cl.exe(C:\Program Files(x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30113\bin\Hostx64\x64)and ninja.exe(*\Anaconda\envs\*\Lib\site-packages\ninja\data\bin)to environment variables(I'm not sure if it make sense). Then I change verbose=False to verbose=True in the call to torch.utils.cpp_extension.load in nvdiffrast/torch/ops.py line 84, and find the plugin's resource file folderC:\Users\*\AppData\Local\torch_extensions\torch_extensions\Cache\nvdiffrast_plugin. I cd to the path and try to ninja it, but find ninja call cl.exe, and some head files miss(in my situation is cstddef). Then I search the file, and add the path to environment var INCLUDE(C:\Program Files(x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30113\include), and the same as LIB(C:\Program Files(x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30113\lib\x64). Finally the plugin is successfully complied.

c1a1o1 commented 2 years ago

I have solved the problem. I meet the problem on Windows, and this is due to ninja fails to compile the plugin. I set cl.exe(C:\Program Files(x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30113\bin\Hostx64\x64)and ninja.exe(*\Anaconda\envs\*\Lib\site-packages\ninja\data\bin)to environment variables(I'm not sure if it make sense). Then I change verbose=False to verbose=True in the call to torch.utils.cpp_extension.load in nvdiffrast/torch/ops.py line 84, and find the plugin's resource file folderC:\Users\*\AppData\Local\torch_extensions\torch_extensions\Cache\nvdiffrast_plugin. I cd to the path and try to ninja it, but find ninja call cl.exe, and some head files miss(in my situation is cstddef). Then I search the file, and add the path to environment var INCLUDE(C:\Program Files(x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30113\include), and the same as LIB(C:\Program Files(x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30113\lib\x64). Finally the plugin is successfully complied. @bo233 How to include the last two

XCR16729438 commented 1 year ago

Got same problems on Ubuntu18.04, WSL2(Windows Subsystem for Linux), with RTX3060laptop. But I success to compile it on Windows10 with same computer. (Strange, I believed linux is always better than windows XD.)_

s-laine commented 1 year ago

OpenGL/Cuda interop isn't currently supported in WSL2 and thus it won't be able to run the OpenGL rasterizer in nvdiffrast.

The next release of nvdiffrast will include a Cuda-based rasterizer that sidesteps the compatibility issues on platforms where OpenGL doesn't work. The release should be out early next week.

s-laine commented 1 year ago

The Cuda rasterizer is now released in v0.3.0. Documentation notes here.

shengzewen commented 6 months ago

有人在windows上成功解决这个问题了吗，或者Linux上

icewired-yy commented 5 months ago

I have an interesting experience when using nvdiffrast on Windows and I would like to share here.

I used to download the CuDNN and add the path to the system environment variables. The way I add CuDNN to path is to create a variable name CUDNN_HOME pointing at the base path of CuDNN directory and add something like %CUDMM_HOME%\bin into path. Then I found my NvDiffrast compilation failed.

So I manually compiled the nvdiffrast via ninja --verbose and I found that something wrong with the content of build.ninja, that CUDNN_HOME unexpectedly appeared in build.ninja, and pointing at a wrong path. Now I think nvdiffrast will automatically detect the CuDNN in system environment variables. But at that time I deleted the CUDNN_HOME and now everything goes well.

I think my case may not cover the general case, but I hope my sharing can help some people who makes the same mistakes like me.

s-laine commented 5 months ago

@icewired-yy Thanks for the report!

Nvdiffrast does not do anything special about CuDNN or look for the related environment variables, but PyTorch's cpp extension builder seems to have some logic related to it here.

Upon a quick glance, it looks like PyTorch expects CUDNN_HOME, if defined, to point to the main CuDNN directory instead of the bin directory. This may explain ninja.build ending up with broken paths.

Good to have this noted here if others bump into the same issue.

createthis commented 5 months ago

I spent... like 10 hours trying to get this to work today on Windows 10 and Visual Studio 2022 using Git Bash (note the unix style c paths). I was able to solve the ninja compilation issues with:

# fixes  functional crtdbg.h basetsd.h
export INCLUDE="/c/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.38.33130/include:/c/Program Files (x86)/Windows Kits/10/Include/10.0.22621.0/ucrt:/c/Program Files (x86)/Windows Kits/10/Include/10.0.22621.0/shared"
# fixes  kernel32.Lib ucrt.lib
export LIB="/c/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.38.33130/lib/x64/:/c/Program Files (x86)/Windows Kits/10/Lib/10.0.22621.0/um/x64/:/c/Program Files (x86)/Windows Kits/10/Lib/10.0.22621.0/ucrt/x64"

There are no more errors building the plugin with ninja. However, I still see:

ImportError: DLL load failed while importing nvdiffrast_plugin: The specified module could not be found.

when building the plugin from threestudio on export.

createthis commented 5 months ago

I finally got past this error with another 2-1/2 hours of work on Windows 10 with Visual Studio Community 2022.

First, See previous comment for how I got nvdiffrast_plugin building correctly using ninja.

Next, I had to figure out why the import was failing. To do this, I needed to manually reproduce the problem:

# This path will be different for each system/person. I got the path from the output of the verbose=True change
cd /c/Users/jesse/AppData/Local/Packages/PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0/LocalCache/Local/torch_extensions/torch_extensions/Cache/py310_cu118/nvdiffrast_plugin
ls -al
python -vvvvvvvvvvvvvv
>>> import nvdiffrast_plugin

Here's a screenshot of my directory listing: nvdiffrast_plugin_dir_listing

Here's a screenshot of the repro: nvdiffrast_plugin

Next, I asked myself why the import was failing when the .pyd file was clearly right there. After some googling, I learned a few things:

.pyd files are basically shared DLLs on Windows.
python will give this error if the shared DLL links other shared DLLs and those DLLs cannot be found.

So let's list the other shared DLLs:

 dumpbin //dependents ./nvdiffrast_plugin.pyd

Here's a screenshot of that output: nvdiffrast_plugin_dumpbin_dependents

Next, I painstakingly identified the location of each of these DLLs and crafted these statements to allow python to find them:

import os
os.add_dll_directory(r"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\lib\x64")
os.add_dll_directory(r"C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\um\x64")
os.add_dll_directory(r"C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\ucrt\x64")

# c10.dll
os.add_dll_directory(r"C:\Users\jesse\Documents\ai\threestudio\venv\Lib\site-packages\torch\lib")

# cudart64_12.dll
os.add_dll_directory(r"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin")

Once this has been done, we can import the plugin successfully: nvdiffrast_plugin_import_working

I then solved the problem in code by adding these statements to the top of launch.py (I'm using this from threestudio).

Hope this saves someone else a day of work. I hate windows! 🤣

cyrildiagne commented 2 months ago

Has anyone found a fix? I'm facing the same issue with RTX 3090, and an environment setup using conda (not using Docker):

Ubuntu 22.04
Pytorch: 1.7.1
CUDA: 11
I've installed all the apt packages listed in the Dockerfile

I clone this repo and run pip install . without error:

Collecting numpy
  Downloading numpy-1.19.5-cp36-cp36m-manylinux2010_x86_64.whl (14.8 MB)
     |████████████████████████████████| 14.8 MB 21.6 MB/s 
Building wheels for collected packages: nvdiffrast
  Building wheel for nvdiffrast (setup.py) ... done
  Created wheel for nvdiffrast: filename=nvdiffrast-0.3.1-py3-none-any.whl size=137866 sha256=f6736342f9499bcab7d5fd651434608921671a66bc4337bde1096d18bb1a9a78
  Stored in directory: /tmp/pip-ephem-wheel-cache-4j89pp58/wheels/fd/b0/9b/ee78c398f92015d6a02b99f5db6a08c41b1a47c4be7e2e0631
Successfully built nvdiffrast
Installing collected packages: numpy, nvdiffrast
Successfully installed numpy-1.19.5 nvdiffrast-0.3.1

But then if I try to import:

import nvdiffrast.torch as dr
dr.RasterizeCudaContext(device=device)

I get the same issue:

ImportError: No module named 'nvdiffrast_plugin'

EDIT: The solution for me was to install CUDA on the system (following this guide) rather than using conda's files

iiiCpu commented 1 month ago

Long story short, nvdiffrast_plugin is build against CUDA_PATH version, not first (or the only) in PATH. So, delete nvdiffrast_plugin and set correct CUDA_PATH before run.

rmdir /S %userprofile%\AppData\Local\torch_extensions\torch_extensions\Cache\py310_cu121\nvdiffrast_plugin
set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\

iiiCpu commented 1 month ago

Long story short, nvdiffrast_plugin is build against CUDA_PATH version, not first (or the only) in PATH. So, delete nvdiffrast_plugin and set correct CUDA_PATH before run.
rmdir /S %userprofile%\AppData\Local\torch_extensions\torch_extensions\Cache\py310_cu121\nvdiffrast_plugin
set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\

@s-laine @nurpax @jannehellsten I think this should be mentioned in official documentation as one of the pre-installation steps. As many ML-activists (you included) have different CUDA versions for different projects, guessing where things went wrong might be sometimes tricky.

Linxmotion commented 1 week ago

I have the same problem, also 3090, looking at the comments, most people have the problem with 3090

NVlabs / nvdiffrast

ImportError: No module named 'nvdiffrast_plugin' #46

x forward update

for GLEW

nvidia-container-runtime

Default pyopengl to EGL for good headless rendering support