Direct GPU acceleration

dcommander commented 11 months ago

I wanted to create this issue to document my findings vis-a-vis adding GPU acceleration directly to TurboVNC, thus eliminating the need for VirtualGL. https://github.com/kasmtech/KasmVNC/commit/d04982125a04962ca4a6d9829b0cdad5793db324 implements DRI3 in KasmVNC, which ostensibly adds GPU acceleration when using open source GPU drivers. It was straightforward to port that code into TurboVNC (although it was necessary to build with TVNC_SYSTEMX11=1.) As of this writing, there are still some major bugs in the feature (https://github.com/kasmtech/KasmVNC/issues/146), so I am not yet prepared to declare the problem solved, but I have high hopes that Kasm will iron out those issues. If they do, then TurboVNC will be able to provide GPU acceleration, without VirtualGL, when using open source GPU drivers. However, I don't think it will ever be possible to do likewise with nVidia's proprietary drivers, at least not as long as they retain their current architecture.

To the best of my understanding (please correct any mistaken assertions I make below):

nVidia's GLX stack relies on its proprietary X.org module, which adds a special NV-GLX extension to the X server. NV-GLX is proprietary, undocumented, and probably doesn't have a stable interface, and nVidia's GLX stack cannot function without it. A physical X server (the hw/xfree86 code path in X.org, as opposed to the "virtual" X servers implemented by Xvnc or Xvfb) is necessary in order to load X.org modules.
GBM provides a way for applications to request buffers from drivers that use the kernel's GPU buffer management and modesetting code. Historically, nVidia's proprietary driver architecture wasn't amenable to that, so they came up with their own solution (EGLStream) to allow Wayland compositors to allocate GPU-based EGLSurfaces. It is only possible to make Mesa-based drivers work with a virtual X server because those drivers do all of the GPU buffer allocation in the X client using GBM.
nVidia now provides a GBM interface as well, but it is only for EGL/Wayland. It also apparently doesn't work "out of the box." You have to specifically configure the nvidia-drm driver to use modesetting, which breaks some of nVidia's proprietary features.
nVidia's drivers don't support DRI3. They still use DRI2. Whereas DRI3 requires X clients to allocate their own render buffers, with DRI2 the X server (mostly) does the allocation. Thus it is impossible, as far as I know, to implement DRI2 in a virtual X server such as Xvnc. You have to use a physical X server, because the GPU-specific buffer allocation code has to be implemented by a GPU-specific X.org module.
Open source GPU drivers are not fully compliant with the GLX spec. GLX technically allows multiple X clients to render simultaneously into the same X drawable, but you can't do that with DRI3, since the X clients (and only the X clients) control GPU buffer allocation. That may be one reason why nVidia continues to use DRI2. (However, the number of applications in 2023 that require simultaneous rendering by multiple X clients into the same X drawable is approximately zero.)

I certainly don't claim that my knowledge is ever complete or final, but to the best of my current understanding, implementing direct GPU acceleration in Xvnc when using nVidia's proprietary drivers will not be possible. VirtualGL will still be necessary with those drivers. I am certainly open to being proven wrong.

dcommander commented 10 months ago

Kasm worked around the issue in their DRI3 implementation, but the workaround is problematic. The basic problem is that their DRI3 implementation creates pixmaps in system memory and maintains a GBM buffer object (in GPU memory) for each, so it has to synchronize the pixels between system memory and GPU memory whenever either the buffer object or the pixmap changes. (NOTE: VirtualGL's implementation of GLX_EXT_texture_from_pixmap has to do that as well, albeit on a more coarse-grained level.) It is straightforward to figure out when a buffer object should be synchronized to its corresponding DRI3-managed pixmap, because Xvnc hooks into the X11 operations that read from the pixmap (always the Composite() or CopyArea() screen methods, in my testing.) However, it is not straightforward to figure out when a DRI3-managed pixmap should be synchronized to its corresponding buffer object, because the buffer object seems to be read outside of X11. That is consistent with my own experience of how direct rendering works. It bypasses X11, which is one reason why screen scrapers don't work with GPU-accelerated 3D applications unless you scrape the screen on a timer. That is basically what Kasm's DRI3 implementation does. It maintains a list of active buffer objects and synchronizes all of them with their corresponding DRI3-managed pixmaps 60 times/second, regardless of whether the pixmaps have actually changed. As you can imagine, this creates a significant amount of performance overhead, and I am skeptical of whether it is free from compatibility issues. Irrespective of the aforementioned timer, DRI3 is capped to the screen refresh rate, which is 60 Hz in Xvnc. Thus, in my testing with the AMDGPU driver, the DRI3 implementation feels like VirtualGL if you set VGL_FPS=60 and VGL_SPOIL=0 (no frame spoiling and frame-rate-limited to 60 Hz), only less smooth. There is a noticeable lag between mouse interaction and rendering, even on a gigabit network (which is why frame spoiling exists in VGL.)

I spent 20-30 uncompensated hours trying to improve the implementation but was unable to. To the best of my understanding, it would be necessary to store pixmaps in GPU memory in order to implement DRI3 cleanly. That would require storing the whole framebuffer in GPU memory, and virtual X servers such as Xvnc cannot do that. Thus, at the moment, I do not think that this solution is appropriate for TurboVNC, since it has significant performance drawbacks relative to VirtualGL. I think that the limited resources of The VirtualGL Project are better spent improving the compatibility of VirtualGL's EGL back end or looking into a TurboVNC Wayland compositor, which could cleanly use GPU memory and potentially pass through GPU acceleration to Xwayland without the need to deal with any of this mess at the X11 level.

dcommander commented 4 months ago

I changed my mind and implemented this, since it provides a solution for using Vulkan with the AMDGPU drivers. (Whereas nVidia's Vulkan implementation does something VirtualGL-like when running in TurboVNC, AMD's implementation doesn't work without the DRI3 extension.) Our implementation of DRI3 is based on KasmVNC's implementation, with only minor changes (mostly cosmetic, but I also used an Xorg linked list instead of a fixed array to track the DRI3 pixmaps.)

TurboVNC / turbovnc

Direct GPU acceleration #373