NVlabs / nvdiffrast

Nvdiffrast - Modular Primitives for High-Performance Differentiable Rendering
Other
1.31k stars 139 forks source link

[E glutil.cpp:248] eglMakeCurrent() failed when setting GL context #45

Closed Mirocos closed 2 years ago

Mirocos commented 2 years ago

Hi, I follow the document and use nvdiffrast in ubuntu18.04LTS with cuda11.4. I executed the following command which throws an exception.

command:

python3 cube.py --resolution 16 --display-interval 10

exception

No output directory specified, not saving log or images
Mesh has 12 triangles and 8 vertices.
iter=0,err=0.489876
[E glutil.cpp:248] eglMakeCurrent() failed when setting GL context
Traceback (most recent call last):
  File "cube.py", line 200, in <module>
    main()
  File "cube.py", line 191, in main
    mp4save_fn='progress.mp4'
  File "cube.py", line 122, in fit_cube
    color     = render(glctx, r_mvp, vtx_pos, pos_idx, vtx_col, col_idx, resolution)
  File "cube.py", line 30, in render
    rast_out, _ = dr.rasterize(glctx, pos_clip, pos_idx, resolution=[resolution, resolution])
  File "/home/zeming/.local/lib/python3.6/site-packages/nvdiffrast/torch/ops.py", line 241, in rasterize
    return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1)
  File "/home/zeming/.local/lib/python3.6/site-packages/nvdiffrast/torch/ops.py", line 175, in forward
    out, out_db = _get_plugin().rasterize_fwd(glctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
RuntimeError: Cuda error: 219[cudaGraphicsMapResources(2, &s.cudaPosBuffer, stream);]
[E glutil.cpp:248] eglMakeCurrent() failed when setting GL context
terminate called after throwing an instance of 'c10::Error'
  what():  Cuda error: 219[cudaGraphicsUnregisterResource(s.cudaPosBuffer);]
Exception raised from rasterizeReleaseBuffers at /home/zeming/.local/lib/python3.6/site-packages/nvdiffrast/common/rasterize.cpp:573 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7efc192eda22 in /home/zeming/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7efc192ea3db in /home/zeming/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: rasterizeReleaseBuffers(int, RasterizeGLState&) + 0xdb (0x7efaa982e63f in /home/zeming/.cache/torch_extensions/nvdiffrast_plugin/nvdiffrast_plugin.so)
frame #3: RasterizeGLStateWrapper::~RasterizeGLStateWrapper() + 0x33 (0x7efaa9885397 in /home/zeming/.cache/torch_extensions/nvdiffrast_plugin/nvdiffrast_plugin.so)
frame #4: std::default_delete<RasterizeGLStateWrapper>::operator()(RasterizeGLStateWrapper*) const + 0x22 (0x7efaa986c9f2 in /home/zeming/.cache/torch_extensions/nvdiffrast_plugin/nvdiffrast_plugin.so)
frame #5: std::unique_ptr<RasterizeGLStateWrapper, std::default_delete<RasterizeGLStateWrapper> >::~unique_ptr() + 0x49 (0x7efaa98618c9 in /home/zeming/.cache/torch_extensions/nvdiffrast_plugin/nvdiffrast_plugin.so)
frame #6: <unknown function> + 0xab003 (0x7efaa985b003 in /home/zeming/.cache/torch_extensions/nvdiffrast_plugin/nvdiffrast_plugin.so)
frame #7: <unknown function> + 0x4ff688 (0x7efc222e2688 in /home/zeming/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x50098e (0x7efc222e398e in /home/zeming/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: python3() [0x5732de]
frame #10: python3() [0x54edd2]
frame #11: python3() [0x588fd8]
frame #12: python3() [0x5add78]
frame #13: python3() [0x5add8e]
frame #14: python3() [0x5add8e]
frame #15: python3() [0x56b606]
<omitting python frames>
frame #21: __libc_start_main + 0xe7 (0x7efc28269bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Any advice ? Thanks !!!

Mirocos commented 2 years ago

looks like version of driver does not match?

dpkg -l | grep -i opengl


ii  libepoxy0:amd64                                             1.4.3-1                                                          amd64        OpenGL function pointer management library
ii  libgl1-mesa-dev:amd64                                       20.0.8-0ubuntu1~18.04.1                                          amd64        free implementation of the OpenGL API -- GLX development files
ii  libgl1-mesa-dri:amd64                                       20.0.8-0ubuntu1~18.04.1                                          amd64        free implementation of the OpenGL API -- DRI modules
ii  libgl1-mesa-dri:i386                                        20.0.8-0ubuntu1~18.04.1                                          i386         free implementation of the OpenGL API -- DRI modules
ii  libgles2-mesa-dev:amd64                                     20.0.8-0ubuntu1~18.04.1                                          amd64        free implementation of the OpenGL|ES 2.x API -- development files
ii  libglfw3:amd64                                              3.2.1-1                                                          amd64        portable library for OpenGL, window and input (x11 libraries)
ii  libglfw3-dev:amd64                                          3.2.1-1                                                          amd64        portable library for OpenGL, window and input (development files)
ii  libglu1-mesa:amd64                                          9.0.0-2.1build1                                                  amd64        Mesa OpenGL utility library (GLU)
ii  libglu1-mesa-dev:amd64                                      9.0.0-2.1build1                                                  amd64        Mesa OpenGL utility library -- development files
ii  libglx-mesa0:amd64                                          20.0.8-0ubuntu1~18.04.1                                          amd64        free implementation of the OpenGL API -- GLX vendor library
ii  libglx-mesa0:i386                                           20.0.8-0ubuntu1~18.04.1                                          i386         free implementation of the OpenGL API -- GLX vendor library
ii  libnvidia-cfg1-470:amd64                                    470.63.01-0ubuntu0.18.04.2                                       amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-fbc1-470:amd64                                    470.63.01-0ubuntu0.18.04.2                                       amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-fbc1-470:i386                                     470.63.01-0ubuntu0.18.04.2                                       i386         NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-470:amd64                                      470.63.01-0ubuntu0.18.04.2                                       amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-gl-470:i386                                       470.63.01-0ubuntu0.18.04.2                                       i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-ifr1-470:amd64                                    470.63.01-0ubuntu0.18.04.2                                       amd64        NVIDIA OpenGL-based Inband Frame Readback runtime library
ii  libnvidia-ifr1-470:i386                                     470.63.01-0ubuntu0.18.04.2                                       i386         NVIDIA OpenGL-based Inband Frame Readback runtime library
ii  libopengl0:amd64                                            1.0.0-2ubuntu2.3                                                 amd64        Vendor neutral GL dispatch library -- OpenGL support
ii  libqt4-opengl:amd64                                         4:4.8.7+dfsg-7ubuntu1                                            amd64        Qt 4 OpenGL module
ii  libqt5opengl5:amd64                                         5.9.5+dfsg-0ubuntu2.6                                            amd64        Qt 5 OpenGL module
ii  libqt5opengl5-dev:amd64                                     5.9.5+dfsg-0ubuntu2.6                                            amd64        Qt 5 OpenGL library development files

nvidia-smi

Thu Sep 23 12:00:02 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:05:00.0  On |                  N/A |
|  0%   49C    P8    13W / 125W |    368MiB /  5941MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1817      G   /usr/lib/xorg/Xorg                 18MiB |
|    0   N/A  N/A      1947      G   /usr/bin/gnome-shell               68MiB |
|    0   N/A  N/A      2929      G   /usr/lib/xorg/Xorg                150MiB |
|    0   N/A  N/A      3053      G   /usr/bin/gnome-shell               38MiB |
|    0   N/A  N/A      3748      G   ...AAAAAAAAA= --shared-files       87MiB |
+-----------------------------------------------------------------------------+
s-laine commented 2 years ago

Setting up EGL is somewhat tricky, so we recommend trying the provided Docker container first and seeing if that works. If you then want to run with a local installation, you can use the container as a reference. Or was this crash with Docker?

I don't know what might explain the difference in reported and installed driver version. Perhaps the system hasn't been rebooted since the update was performed?

Mirocos commented 2 years ago

Now I fixed the problem of version mismatch. But still failed with same error.

However, sample command succeed without the option "--display-interval".

The reason that I didn't use docker is the process crashed with follow message:

./run_sample.sh ./samples/torch/cube.py --resolution 32

Using container image: gltorch:latest
Running command: ./samples/torch/cube.py --resolution 32
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

this is the message when build container:

./run_sample.sh --build-container
Sending build context to Docker daemon  56.73MB
Step 1/14 : ARG BASE_IMAGE=pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel
Step 2/14 : FROM $BASE_IMAGE
 ---> 7554ac65eba5
Step 3/14 : RUN apt-get update && apt-get install -y --no-install-recommends     pkg-config     libglvnd0     libgl1     libglx0     libegl1     libgles2     libglvnd-dev     libgl1-mesa-dev     libegl1-mesa-dev     libgles2-mesa-dev     cmake     curl
 ---> Using cache
 ---> e4a09e440d68
Step 4/14 : ENV PYTHONDONTWRITEBYTECODE=1
 ---> Using cache
 ---> 631c224d81ed
Step 5/14 : ENV PYTHONUNBUFFERED=1
 ---> Using cache
 ---> 1ba7bd67688d
Step 6/14 : ENV LD_LIBRARY_PATH /usr/lib64:$LD_LIBRARY_PATH
 ---> Using cache
 ---> f952c0b196b8
Step 7/14 : ENV NVIDIA_VISIBLE_DEVICES all
 ---> Using cache
 ---> f12db9f73bc8
Step 8/14 : ENV NVIDIA_DRIVER_CAPABILITIES compute,utility,graphics
 ---> Using cache
 ---> aa3eff00fe58
Step 9/14 : ENV PYOPENGL_PLATFORM egl
 ---> Using cache
 ---> 19e5de1e15e0
Step 10/14 : COPY docker/10_nvidia.json /usr/share/glvnd/egl_vendor.d/10_nvidia.json
 ---> Using cache
 ---> 376259c0df7d
Step 11/14 : RUN pip install imageio imageio-ffmpeg
 ---> Using cache
 ---> dc0e3348ca79
Step 12/14 : COPY nvdiffrast /tmp/pip/nvdiffrast/
 ---> Using cache
 ---> 2d1212d00ab5
Step 13/14 : COPY README.md setup.py /tmp/pip/
 ---> Using cache
 ---> 4146b33d2378
Step 14/14 : RUN cd /tmp/pip && pip install .
 ---> Using cache
 ---> 69a6c149e11c
Successfully built 69a6c149e11c
Successfully tagged gltorch:latest
Sending build context to Docker daemon  56.73MB
Step 1/14 : ARG BASE_IMAGE=pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel
Step 2/14 : FROM $BASE_IMAGE
 ---> e544497892a3
Step 3/14 : RUN apt-get update && apt-get install -y --no-install-recommends     pkg-config     libglvnd0     libgl1     libglx0     libegl1     libgles2     libglvnd-dev     libgl1-mesa-dev     libegl1-mesa-dev     libgles2-mesa-dev     cmake     curl
 ---> Using cache
 ---> 12a6fec8eaca
Step 4/14 : ENV PYTHONDONTWRITEBYTECODE=1
 ---> Using cache
 ---> 4ce4de9c99d6
Step 5/14 : ENV PYTHONUNBUFFERED=1
 ---> Using cache
 ---> 3ad3686f14a5
Step 6/14 : ENV LD_LIBRARY_PATH /usr/lib64:$LD_LIBRARY_PATH
 ---> Using cache
 ---> 2a3647186c1b
Step 7/14 : ENV NVIDIA_VISIBLE_DEVICES all
 ---> Using cache
 ---> 75f2e984a64a
Step 8/14 : ENV NVIDIA_DRIVER_CAPABILITIES compute,utility,graphics
 ---> Using cache
 ---> 73fb77c464e0
Step 9/14 : ENV PYOPENGL_PLATFORM egl
 ---> Using cache
 ---> 12542217941c
Step 10/14 : COPY docker/10_nvidia.json /usr/share/glvnd/egl_vendor.d/10_nvidia.json
 ---> Using cache
 ---> d3ee66f5a806
Step 11/14 : RUN pip install imageio imageio-ffmpeg
 ---> Using cache
 ---> 10e1baa2bf04
Step 12/14 : COPY nvdiffrast /tmp/pip/nvdiffrast/
 ---> Using cache
 ---> 3c6c52c781ac
Step 13/14 : COPY README.md setup.py /tmp/pip/
 ---> Using cache
 ---> e4ec44f0c452
Step 14/14 : RUN cd /tmp/pip && pip install .
 ---> Using cache
 ---> 3d7f37505e25
Successfully built 3d7f37505e25
Successfully tagged gltensorflow:latest

No python sample given or file '' not found.  Exiting.
s-laine commented 2 years ago

Thank you for the clarification. The interactive display on Linux is apparently very problematic, and I'm not sure if anyone has found a setup where it works so far. As of now, disabling the interactive display seems to be the only possibility.

I'll add a note about this in the documentation in the next update.

Mirocos commented 2 years ago

so does the docker container build well? and how to solve the message like: "docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]."

nurpax commented 2 years ago

Can you run anything else in Docker that uses GPU? Like some basic pytorch samples using official PyTorch containers? This would be a useful test to check that your Docker/drivers are working. And make sure you have the nvidia runtime correctly installed for Docker.

One thing that sometimes happens to me personally, is that under Ubuntu, the graphics drivers get updated under the hood (without anyone asking or me approving such an update) and that anything running CUDA or graphics fails to run until I reboot my Ubuntu box. (The problem being mismatched drivers and Linux kernel.) Not saying this will fix your problem but it may be worth trying out.

Other than, hard to say what could be going wrong.

Mirocos commented 2 years ago

Hi, I found such a message can be solved by following setup:

Setup the stable repository and the GPG key of Nvidia Container Toolkit:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt -y install nvidia-docker2
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

then run the example in docker succeed on my machine

s-laine commented 2 years ago

Sounds like the problem was solved, closing.

Luh1124 commented 2 years ago

Hello, I encountered the same error here. Have you finally solved it?

Mirocos commented 2 years ago

Did you run this in linux environment? If so, the --display-interval option seems to not work in such condition. I have tried so many times to solve this problem, so I read the project source code, together with some references, I got my own conclusion: The Context passed to EGL or GLFW can not be shared in linux environment, which makes it crash when use --display-interval to open a window for real-time visualization. Unless you re-coding the source code and change EGL context to GLFW context, I guess

But it still works for its main functions or purposes, you can dump the tensors to local image if you need to visualize something.

Or, why not consider to use NvDiffRast in windows environment?

------------------ 原始邮件 ------------------ 发件人: "NVlabs/nvdiffrast" @.>; 发送时间: 2022年7月25日(星期一) 晚上6:51 @.>; @.**@.>; 主题: Re: [NVlabs/nvdiffrast] [E glutil.cpp:248] eglMakeCurrent() failed when setting GL context (#45)

您好,我在本地遇到了一样的错误,请问最后解决了吗?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Luh1124 commented 2 years ago

Thank you, I can run normally in Linux without "--display interval". Specify outdir to view the results