EGL Backend: Fail to run with CUDA OpenGL interop

duongnb09 commented 2 years ago

Tried to run VirtualGL using EGL backend with the official CUDA OpenGL interop sample code, but it keeps failing the following error messages

CUDA error at main.cpp:175 code=999(cudaErrorUnknown) "cudaGraphicsGLRegisterBuffer(pbo_resource, *pbo, cudaGraphicsMapFlagsNone)" X Error of failed request: 0 Major opcode of failed request: 150 (GLX) Minor opcode of failed request: 4 (X_GLXDestroyContext) Serial number of failed request: 93 Current serial number in output stream: 93

dcommander commented 2 years ago

I can reproduce the error, but unfortunately I cannot readily figure out why it is occurring, due to the fact that nVidia's libraries are closed-source. This may be a similar issue to the issues with their Vulkan drivers (https://forums.developer.nvidia.com/t/headless-vulkan-with-multiple-gpus/222832/15), in that the CUDA driver may make some assumptions regarding the X display that are not valid in a remote display environment.

dcommander commented 2 years ago

Note that OpenCL/OpenGL interop also doesn't work with the EGL back end, because nVidia doesn't support the CL_EGL_DISPLAY_KHR property in their implementation of clCreateContext(). It seems that, despite supporting device-based EGL with OpenGL, some of nVidia's other APIs are tied to X11 in some way.

dcommander commented 9 months ago

This is still an issue, unfortunately, and it doesn't appear as if it's something that can be fixed in VirtualGL. I can only guess that CUDA is somehow complaining about the fact that VirtualGL is sneaking in an EGL context behind the scenes when CUDA expects a GLX context. (Perhaps CUDA OpenGL interop is tied to the NV-GLX extension in some way?) Anyhow, in order to fully diagnose it, I will need to create a minimally reproducible test case (i.e. to demonstrate how to reproduce the issue without VGL) and forward the issue to nVidia. Since CUDA is a propriatary API, I don't have any particular desire to do any of that work for free, so I have tagged this issue as "funding needed."

mp3guy commented 3 months ago

I'm also hitting this issue where cudaGraphicsGLRegisterImage fails with error 304, cudaErrorOperatingSystem. However, I can confirm that this works fine without VirtualGL in situations where you use an EGL context with no GLX, both "headless" and windowed. So it seems unlikely it's specifically EGL related.

I guess one work around is where the application opens its own EGL context separate to the VirtualGL one, as this works for interop. And then that is shared with the VirtualGL one for presentation/rendering?

I can confirm this issue persists when the host application creates a GL context through either EGL or GLX; only on GLX the error is cudaErrorUnknown.

dcommander commented 3 months ago

Here's what I observe on my Rocky Linux 8.5 machine with CUDA Toolkit 12.6, nVidia 550.90.07, and a Quadro P620:

simpleCUDA2GL with USE_TEXSUBIMAGE2D defined (which causes the program to call cudaGraphicsGLRegisterBuffer()) works fine with the GLX back end, but cudaGraphicsGLRegisterBuffer() fails with cudaErrorUnknown when using the EGL back end.
simpleCUDA2GL with USE_TEXSUBIMAGE2D undefined (which causes the program to call cudaGraphicsGLRegisterImage()) works fine with the GLX back end, but cudaGraphicsGLRegisterImage() fails with cudaErrorUnknown when using the EGL back end.

Unfortunately, since CUDA is closed-source, I am completely clueless as to how to diagnose the issue. I used APITrace to obtain a trace of simpleCUDA2GL, but it doesn't show any OpenGL calls being made from within CUDA. It shows only the calls being made by the simpleCUDA2GL program itself.

dcommander commented 3 months ago

NOTE: Since the resources being passed to CUDA are OpenGL resources, not GLX or EGL resources, it shouldn't really matter how the context was created (but apparently it does, which is the fundamental mystery behind this issue.)

VirtualGL / virtualgl

EGL Backend: Fail to run with CUDA OpenGL interop #209