Add missing python dependency for examples

nachovizzo commented 3 years ago

Following this example I have encountered the following error:

./run_sample.sh samples/torch/cube.py --resolution 32 --display-interval 10            
Using container image: gltorch:latest
Running command: samples/torch/cube.py --resolution 32 --display-interval 10
No output directory specified, not saving log or images
Mesh has 12 triangles and 8 vertices.
iter=0,err=0.473205
Traceback (most recent call last):
  File "samples/torch/cube.py", line 200, in <module>
    main()
  File "samples/torch/cube.py", line 191, in main
    mp4save_fn='progress.mp4'
  File "samples/torch/cube.py", line 150, in fit_cube
    util.display_image(result_image, size=display_res, title='%d / %d' % (it, max_iter))
  File "/app/samples/torch/util.py", line 69, in display_image
    import OpenGL.GL as gl
ModuleNotFoundError: No module named 'OpenGL'

Later on, I found that also glfw was also missing.

This is easily solved in this PR by just updating the docker image.

nurpax commented 3 years ago

Thanks for the PR. I will add the required deps into our upstream version and they will flow into the GitHub version at some point. We don't unfortunately merge PRs on GitHub.

The display interval parameter will open a window and show optimization results with OpenGL. So to run it, I think you also need to give some extra args to Docker, something like:

# see https://developer.nvidia.com/blog/gpu-containers-runtime/
xhost +si:localuser:root # allow root user to access the running X server
docker run --rm -it --gpus all --user $(id -u):$(id -g) -v `pwd`:/app -v /tmp/.X11-unix:/tmp/.X11-unix --workdir /app -e DISPLAY -e TORCH_EXTENSIONS_DIR=/app/tmp gltorch:latest python3 ./samples/torch/cube.py --resolution 32 --display-interval 10

When I try this I hit a bug though:

No output directory specified, not saving log or images
Mesh has 12 triangles and 8 vertices.
iter=0,err=0.472302
[E glutil.inl:164] eglMakeCurrent() failed when setting GL context
Traceback (most recent call last):
  File "./samples/torch/cube.py", line 200, in <module>
    main()
  File "./samples/torch/cube.py", line 191, in main
    mp4save_fn='progress.mp4'
  File "./samples/torch/cube.py", line 122, in fit_cube
    color     = render(glctx, r_mvp, vtx_pos, pos_idx, vtx_col, col_idx, resolution)
  File "./samples/torch/cube.py", line 30, in render
    rast_out, _ = dr.rasterize(glctx, pos_clip, pos_idx, resolution=[resolution, resolution])
  File "/opt/conda/lib/python3.7/site-packages/nvdiffrast/torch/ops.py", line 223, in rasterize
    return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db)
  File "/opt/conda/lib/python3.7/site-packages/nvdiffrast/torch/ops.py", line 165, in forward
    out, out_db = _get_plugin().rasterize_fwd(glctx.cpp_wrapper, pos, tri, resolution, ranges)
RuntimeError: CUDA error: invalid OpenGL or DirectX context

This looks like a bug on our side.

@nachovizzo, you didn't hit this problem?

nachovizzo commented 3 years ago

I did face a similar issue:

Using container image: gltorch:latest
Running command: samples/torch/earth.py --display-interval 10
No output directory specified, not saving log or images
Mesh has 12288 triangles and 6146 vertices.
iter=0,loss=0.272971,psnr=11.277677
/opt/conda/lib/python3.7/site-packages/glfw/__init__.py:834: GLFWError: (65544) b'X11: The DISPLAY environment variable is missing'
  warnings.warn(message, GLFWError)
/opt/conda/lib/python3.7/site-packages/glfw/__init__.py:834: GLFWError: (65537) b'The GLFW library is not initialized'
  warnings.warn(message, GLFWError)
python3: /builds/florianrhiem/pyGLFW/glfw-3.3.2/src/posix_thread.c:64: _glfwPlatformGetTls: Assertion `tls->posix.allocated == 1' failed.

But somehow I related it to the fact I was using an ssh connection. Running now on a docker with the flags above I get the following error:

docker run --rm -it --gpus all --user $(id -u):$(id -g) -v `pwd`:/app -v /tmp/.X11-unix:/tmp/.X11-unix --workdir /app -e DISPLAY -e TORCH_EXTENSIONS_DIR=/app/tmp gltorch:latest python3 ./samples/torch/cube.py --resolution 32 --display-interval 10

No output directory specified, not saving log or images
Mesh has 12 triangles and 8 vertices.
[E glutil.inl:195] eglInitialize() failed
[E glutil.inl:214] eglChooseConfig() failed
[E glutil.inl:226] eglCreatePbufferSurface() failed
[E glutil.inl:235] eglCreateContext() failed
[E glutil.inl:161] setGLContext() called with null gltcx
[E glutil.inl:171] glewInit() failed, return value = 1
Traceback (most recent call last):
  File "./samples/torch/cube.py", line 200, in <module>
    main()
  File "./samples/torch/cube.py", line 191, in main
    mp4save_fn='progress.mp4'
  File "./samples/torch/cube.py", line 76, in fit_cube
    glctx = dr.RasterizeGLContext()
  File "/opt/conda/lib/python3.7/site-packages/nvdiffrast/torch/ops.py", line 142, in __init__
    self.cpp_wrapper = _get_plugin().RasterizeGLStateWrapper(output_db, mode == 'automatic')
RuntimeError: OpenGL 4.4 or later is required

nairb2020 commented 3 years ago

@nachovizzo I think I'm running into the similar issue when I'm running on a Linux server on AWS (ssh in). When I run

nvdiffrast/samples/torch/envphong.py

I get

Creating GL context for Cuda device 0
Failed, falling back to default display
eglInitialize() failed
eglChooseConfig() failed
eglCreateContext() failed
EGL 1471947312.32765 OpenGL context created (disp: 0x0000000082415470, ctx: 0x0000000000000000)
setGLContext() called with null gltcx
Traceback (most recent call last):
  File "nvdiffrast/samples/torch/envphong.py", line 226, in <module>
    main()
  File "nvdiffrast/samples/torch/envphong.py", line 211, in main
    fit_env_phong(
  File "nvdiffrast/samples/torch/envphong.py", line 77, in fit_env_phong
    glctx = dr.RasterizeGLContext()
  File "/usr/local/lib/python3.8/dist-packages/nvdiffrast/torch/ops.py", line 151, in __init__
    self.cpp_wrapper = _get_plugin().RasterizeGLStateWrapper(output_db, mode == 'automatic', cuda_device_idx)
RuntimeError: OpenGL 4.4 or later is required

I tried to replicate all the libraries inside the docker image. Since I'm trying to develop on top of it I prefer to use virtual environments instead of docker. Could you share some insight in how you got it to work in the end?

nachovizzo commented 3 years ago

@nachovizzo I think I'm running into the similar issue when I'm running on a Linux server on AWS (ssh in). When I run

nvdiffrast/samples/torch/envphong.py

I get

Creating GL context for Cuda device 0
Failed, falling back to default display
eglInitialize() failed
eglChooseConfig() failed
eglCreateContext() failed
EGL 1471947312.32765 OpenGL context created (disp: 0x0000000082415470, ctx: 0x0000000000000000)
setGLContext() called with null gltcx
Traceback (most recent call last):
  File "nvdiffrast/samples/torch/envphong.py", line 226, in <module>
    main()
  File "nvdiffrast/samples/torch/envphong.py", line 211, in main
    fit_env_phong(
  File "nvdiffrast/samples/torch/envphong.py", line 77, in fit_env_phong
    glctx = dr.RasterizeGLContext()
  File "/usr/local/lib/python3.8/dist-packages/nvdiffrast/torch/ops.py", line 151, in __init__
    self.cpp_wrapper = _get_plugin().RasterizeGLStateWrapper(output_db, mode == 'automatic', cuda_device_idx)
RuntimeError: OpenGL 4.4 or later is required

I tried to replicate all the libraries inside the docker image. Since I'm trying to develop on top of it I prefer to use virtual environments instead of docker. Could you share some insight in how you got it to work in the end?

Hello there, so I guess(because I don't quite remember) that the changes of this PR solved that problem for me. I hope that helps

nurpax commented 3 years ago

Headless operation should not need glfw or pyopengl, so I don't quite see how the PR would fix headless operation.

The samples do need glfw and pyopengl, but I don't include those in the Dockerfile as I never got nvdiffrast headless rendering and interactive mode working with Docker and Linux.

Creating GL context for Cuda device 0
Failed, falling back to default display
eglInitialize() failed
eglChooseConfig() failed
eglCreateContext() failed

Looks like all EGL related calls are failing. I'd debug further by adding some error checking related prints into https://github.com/NVlabs/nvdiffrast/blob/main/nvdiffrast/common/glutil.cpp.

nairb2020 commented 3 years ago

I tried to run it and it fails at

eglCreateContext(display, config, EGL_NO_CONTEXT, NULL);

with the error:

"eglCreateContext() Failed", error code 0x3005 -> EGL_BAD_CONFIG.

This is a bit strange since I thought the config comes from the previous line eglChooseConfig(display, context_attribs, &config, 1, &num_config) and that one succeeded .... Any hints?

nairb2020 commented 3 years ago

This is fixed in https://github.com/NVlabs/nvdiffrast/issues/24. Turns out I just needed to upgrade nVidia driver and reboot. 🤣

changkun commented 3 years ago

I have the same issue but I do have the latest version of the code, what could go wrong?

$ python cube.py --resolution 16 --display-interval 10
No output directory specified, not saving log or images
Mesh has 12 triangles and 8 vertices.
iter=0,err=0.465402
[E glutil.cpp:248] eglMakeCurrent() failed when setting GL context
Traceback (most recent call last):
  File "cube.py", line 200, in <module>
    main()
  File "cube.py", line 191, in main
    mp4save_fn='progress.mp4'
  File "cube.py", line 122, in fit_cube
    color     = render(glctx, r_mvp, vtx_pos, pos_idx, vtx_col, col_idx, resolution)
  File "cube.py", line 30, in render
    rast_out, _ = dr.rasterize(glctx, pos_clip, pos_idx, resolution=[resolution, resolution])
  File "/home/changkun/miniconda3/envs/diffrast/lib/python3.7/site-packages/nvdiffrast-0.2.5-py3.7.egg/nvdiffrast/torch/ops.py", line 237, in rasterize
    return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1)
  File "/home/changkun/miniconda3/envs/diffrast/lib/python3.7/site-packages/nvdiffrast-0.2.5-py3.7.egg/nvdiffrast/torch/ops.py", line 175, in forward
    out, out_db = _get_plugin().rasterize_fwd(glctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
RuntimeError: Cuda error: 219[cudaGraphicsMapResources(2, &s.cudaPosBuffer, stream);]

$ python -V
Python 3.7.10

$ python -c "import torch; print(torch.version.cuda)"
11.1

nairb2020 commented 3 years ago

@changkun your issue is slightly different from mine. Mine was failing at

"eglCreateContext() Failed", error code 0x3005 -> EGL_BAD_CONFIG.

Could you print out your C code stack trace so it's easier? For me, I had to restart my linux server and that fixed it.

changkun commented 3 years ago

@nairb2020 Thanks for the swift response. How to print out the C code stack trace in this python calling C case?

s-laine commented 3 years ago

Hi @changkun! The failure occurs when trying to map OpenGL buffers to Cuda memory space. I re-inspected the code related to Cuda/OpenGL buffer management but I cannot immediately see what could cause the bug that you're seeing. The error code (219: cudaErrorInvalidGraphicsContext) suggests a problem with the OpenGL context, which in turn points to something in the OS or graphics drivers.

Could you try calling dr.set_log_level(0) before any other nvdiffrast calls and paste the resulting log here? There's no need for a more detailed stack trace, as the failing call on the C++ side is already shown in the output.

changkun commented 3 years ago

Hi @s-laine , thanks very much for inspecting the error.

I just add the call you suggested before the main function of the cube.py example:

$ python cube.py --resolution 16 --display-interval 10
No output directory specified, not saving log or images
Mesh has 12 triangles and 8 vertices.
[I glutil.cpp:322] Creating GL context for Cuda device 0
[I glutil.cpp:370] EGL 5.1 OpenGL context created (disp: 0x000055f4db410210, ctx: 0x000055f46d6e8a91)
[I rasterize.cpp:91] OpenGL version reported as 4.6
iter=0,err=0.496746
[I rasterize.cpp:332] Increasing position buffer size to 64 float32
[I rasterize.cpp:343] Increasing triangle buffer size to 64 int32
[I rasterize.cpp:368] Increasing frame buffer size to (width, height, depth) = (32, 32, 1)
[I rasterize.cpp:394] Increasing range array size to 64 elements
[I rasterize.cpp:368] Increasing frame buffer size to (width, height, depth) = (512, 512, 1)
[E glutil.cpp:248] eglMakeCurrent() failed when setting GL context
Traceback (most recent call last):
  File "cube.py", line 201, in <module>
    main()
  File "cube.py", line 191, in main
    mp4save_fn='progress.mp4'
  File "cube.py", line 122, in fit_cube
    color     = render(glctx, r_mvp, vtx_pos, pos_idx, vtx_col, col_idx, resolution)
  File "cube.py", line 30, in render
    rast_out, _ = dr.rasterize(glctx, pos_clip, pos_idx, resolution=[resolution, resolution])
  File "/home/changkun/miniconda3/envs/diffrast/lib/python3.7/site-packages/nvdiffrast-0.2.5-py3.7.egg/nvdiffrast/torch/ops.py", line 237, in rasterize
    return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1)
  File "/home/changkun/miniconda3/envs/diffrast/lib/python3.7/site-packages/nvdiffrast-0.2.5-py3.7.egg/nvdiffrast/torch/ops.py", line 175, in forward
    out, out_db = _get_plugin().rasterize_fwd(glctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
RuntimeError: Cuda error: 219[cudaGraphicsMapResources(2, &s.cudaPosBuffer, stream);]
[I glutil.cpp:391] EGL OpenGL context destroyed (disp: 0x000055f4db410210, ctx: 0x000055f46d6e8a91)

and pose.py

$ python pose.py --display-interval 10
No output directory specified, not saving log or images
Mesh has 12 triangles and 24 vertices.
[I glutil.cpp:322] Creating GL context for Cuda device 0
[I glutil.cpp:370] EGL 5.1 OpenGL context created (disp: 0x0000556b7adb4e20, ctx: 0x0000556b0e05e971)
[I rasterize.cpp:91] OpenGL version reported as 4.6
[I rasterize.cpp:332] Increasing position buffer size to 96 float32
[I rasterize.cpp:343] Increasing triangle buffer size to 64 int32
[I rasterize.cpp:368] Increasing frame buffer size to (width, height, depth) = (256, 256, 1)
[I rasterize.cpp:394] Increasing range array size to 64 elements
rep=0,iter=0,err=147.363927,err_best=147.363927,loss=0.211817,loss_best=0.211817,lr=0.010000,nr=1.000000
[E glutil.cpp:248] eglMakeCurrent() failed when setting GL context
Traceback (most recent call last):
  File "pose.py", line 291, in <module>
    main()
  File "pose.py", line 281, in main
    mp4save_fn='progress.mp4'
  File "pose.py", line 195, in fit_pose
    color          = render(glctx, torch.matmul(mvp, q_to_mtx(pose_target)), vtx_pos, pos_idx, vtx_col, col_idx, resolution)
  File "pose.py", line 112, in render
    rast_out, _ = dr.rasterize(glctx, pos_clip, pos_idx, resolution=[resolution, resolution])
  File "/home/changkun/miniconda3/envs/diffrast/lib/python3.7/site-packages/nvdiffrast-0.2.5-py3.7.egg/nvdiffrast/torch/ops.py", line 237, in rasterize
    return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1)
  File "/home/changkun/miniconda3/envs/diffrast/lib/python3.7/site-packages/nvdiffrast-0.2.5-py3.7.egg/nvdiffrast/torch/ops.py", line 175, in forward
    out, out_db = _get_plugin().rasterize_fwd(glctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
RuntimeError: Cuda error: 219[cudaGraphicsMapResources(2, &s.cudaPosBuffer, stream);]
[I glutil.cpp:391] EGL OpenGL context destroyed (disp: 0x0000556b7adb4e20, ctx: 0x0000556b0e05e971)

and envphong.py:

$ python envphong.py --display-interval 10No output directory specified, not saving log or images
Mesh has 30720 triangles and 15362 vertices.
[I glutil.cpp:322] Creating GL context for Cuda device 0
[I glutil.cpp:370] EGL 5.1 OpenGL context created (disp: 0x000055c64ec9fe30, ctx: 0x000055c5e046bd01)
[I rasterize.cpp:91] OpenGL version reported as 4.6
[I rasterize.cpp:332] Increasing position buffer size to 65536 float32
[I rasterize.cpp:343] Increasing triangle buffer size to 98304 int32
[I rasterize.cpp:368] Increasing frame buffer size to (width, height, depth) = (1024, 1024, 1)
[I rasterize.cpp:394] Increasing range array size to 64 elements
iter=0,phong_rgb_rmse=0.398101,phong_exp_rel_err=0.627478,img_rmse=0.016194
[E glutil.cpp:248] eglMakeCurrent() failed when setting GL context
Traceback (most recent call last):
  File "envphong.py", line 227, in <module>
    main()
  File "envphong.py", line 217, in main
    mp4save_fn='progress.mp4'
  File "envphong.py", line 131, in fit_env_phong
    refl, refld, ldotr, mask = render_refl(lightdir, r_campos, r_mvp)
  File "envphong.py", line 120, in render_refl
    rast_out, rast_out_db = dr.rasterize(glctx, pos_clip, pos_idx, [res, res])
  File "/home/changkun/miniconda3/envs/diffrast/lib/python3.7/site-packages/nvdiffrast-0.2.5-py3.7.egg/nvdiffrast/torch/ops.py", line 237, in rasterize
    return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1)
  File "/home/changkun/miniconda3/envs/diffrast/lib/python3.7/site-packages/nvdiffrast-0.2.5-py3.7.egg/nvdiffrast/torch/ops.py", line 175, in forward
    out, out_db = _get_plugin().rasterize_fwd(glctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
RuntimeError: Cuda error: 219[cudaGraphicsMapResources(2, &s.cudaPosBuffer, stream);]
[I glutil.cpp:391] EGL OpenGL context destroyed (disp: 0x000055c64ec9fe30, ctx: 0x000055c5e046bd01)

s-laine commented 3 years ago

This looks like a conflict between the OpenGL contexts used by nvdiffrast and used for showing the interactive results window (--display-interval parameter). Do you see the window flashing open before the crash? Do you get errors if you leave the parameter out so that there's no visual output?

Are you running this in Docker? As noted in the comment above, headless rendering (as used by nvdiffrast) + results window + Docker = problems, and we unfortunately don't know why. Based on the other comments, it sounds like the combination works for some users, so I guess it may be related to graphics driver version, possibly Docker version, or even glfw and pyopengl versions.

The log shows that nvdiffrast manages to render at first but then fails to use the same OpenGL context again sometime later. In between, there is presumably at least an attempt to show the result image to the user by opening a window, which uses glfw and pyopengl. My hunch is that when opening the window, glfw or pyopengl does something that effectively reinitializes EGL or otherwise causes our internal OpenGL context to be invalidated. As a workaround, you could try opening the window before creating the OpenGL context for nvdiffrast — maybe it works in the other order. You can do this with a call such as

util.display_image(np.zeros([256, 256, 3], dtype=np.float32))

somewhere before glctx = dr.RasterizeGLContext().

changkun commented 3 years ago

Thanks again for helping out.

Do you see the window flashing open before the crash?

Yes. I saw a window with rendered content then exit immediately with that error.

Do you get errors if you leave the parameter out so that there's no visual output?

No. Everything seems to work without the --display-interval param.

Are you running this in Docker?

No. It is a local miniconda environment.

You can do this with a call such as somewhere before glctx = dr.RasterizeGLContext().

I just tried inserting the given line directly before the dr.RasterizeGLContext() in the cube.py example. There is a black window popup (without the rendered content), then exit in seconds with the following error:

$ python cube.py --resolution 16 --display-interval 10
No output directory specified, not saving log or images
Mesh has 12 triangles and 8 vertices.
[I glutil.cpp:322] Creating GL context for Cuda device 0
[I glutil.cpp:370] EGL 5.1 OpenGL context created (disp: 0x0000556c12cd9f80, ctx: 0x0000556ba41607d1)
[E glutil.cpp:248] eglMakeCurrent() failed when setting GL context
[I rasterize.cpp:91] OpenGL version reported as 4.6
[W glutil.cpp:260] releaseGLContext() called with no active display
[F glutil.cpp:262] eglMakeCurrent() failed when releasing GL context
[1]    1824431 abort (core dumped)  python cube.py --resolution 16 --display-interval 10

s-laine commented 3 years ago

Thanks for the information. It appears there's a fundamental conflict between glfw/pyopengl and nvdiffrast on Linux, and unfortunately I don't have any further advice for troubleshooting. The interactive display window was originally tested on Windows where it works fine, but the maze of different graphics libraries on Linux makes this much more difficult.

I may try to sort this out and find a working configuration some time in the future when I get access to a Linux machine. But for now I think we'll have to just declare the --display-interval parameter as unsupported on Linux.

NVlabs / nvdiffrast

Add missing python dependency for examples #7