TurboVNC / turbovnc

Main TurboVNC repository
https://TurboVNC.org
GNU General Public License v2.0
746 stars 136 forks source link

Direct GPU acceleration #373

Closed dcommander closed 10 months ago

dcommander commented 11 months ago

I wanted to create this issue to document my findings vis-a-vis adding GPU acceleration directly to TurboVNC, thus eliminating the need for VirtualGL. https://github.com/kasmtech/KasmVNC/commit/d04982125a04962ca4a6d9829b0cdad5793db324 implements DRI3 in KasmVNC, which ostensibly adds GPU acceleration when using open source GPU drivers. It was straightforward to port that code into TurboVNC (although it was necessary to build with TVNC_SYSTEMX11=1.) As of this writing, there are still some major bugs in the feature (https://github.com/kasmtech/KasmVNC/issues/146), so I am not yet prepared to declare the problem solved, but I have high hopes that Kasm will iron out those issues. If they do, then TurboVNC will be able to provide GPU acceleration, without VirtualGL, when using open source GPU drivers. However, I don't think it will ever be possible to do likewise with nVidia's proprietary drivers, at least not as long as they retain their current architecture.

To the best of my understanding (please correct any mistaken assertions I make below):

I certainly don't claim that my knowledge is ever complete or final, but to the best of my current understanding, implementing direct GPU acceleration in Xvnc when using nVidia's proprietary drivers will not be possible. VirtualGL will still be necessary with those drivers. I am certainly open to being proven wrong.

dcommander commented 10 months ago

Kasm worked around the issue in their DRI3 implementation, but the workaround is problematic. The basic problem is that their DRI3 implementation creates pixmaps in system memory and maintains a GBM buffer object (in GPU memory) for each, so it has to synchronize the pixels between system memory and GPU memory whenever either the buffer object or the pixmap changes. (NOTE: VirtualGL's implementation of GLX_EXT_texture_from_pixmap has to do that as well, albeit on a more coarse-grained level.) It is straightforward to figure out when a buffer object should be synchronized to its corresponding DRI3-managed pixmap, because Xvnc hooks into the X11 operations that read from the pixmap (always the Composite() or CopyArea() screen methods, in my testing.) However, it is not straightforward to figure out when a DRI3-managed pixmap should be synchronized to its corresponding buffer object, because the buffer object seems to be read outside of X11. That is consistent with my own experience of how direct rendering works. It bypasses X11, which is one reason why screen scrapers don't work with GPU-accelerated 3D applications unless you scrape the screen on a timer. That is basically what Kasm's DRI3 implementation does. It maintains a list of active buffer objects and synchronizes all of them with their corresponding DRI3-managed pixmaps 60 times/second, regardless of whether the pixmaps have actually changed. As you can imagine, this creates a significant amount of performance overhead, and I am skeptical of whether it is free from compatibility issues. Irrespective of the aforementioned timer, DRI3 is capped to the screen refresh rate, which is 60 Hz in Xvnc. Thus, in my testing with the AMDGPU driver, the DRI3 implementation feels like VirtualGL if you set VGL_FPS=60 and VGL_SPOIL=0 (no frame spoiling and frame-rate-limited to 60 Hz), only less smooth. There is a noticeable lag between mouse interaction and rendering, even on a gigabit network (which is why frame spoiling exists in VGL.)

I spent 20-30 uncompensated hours trying to improve the implementation but was unable to. To the best of my understanding, it would be necessary to store pixmaps in GPU memory in order to implement DRI3 cleanly. That would require storing the whole framebuffer in GPU memory, and virtual X servers such as Xvnc cannot do that. Thus, at the moment, I do not think that this solution is appropriate for TurboVNC, since it has significant performance drawbacks relative to VirtualGL. I think that the limited resources of The VirtualGL Project are better spent improving the compatibility of VirtualGL's EGL back end or looking into a TurboVNC Wayland compositor, which could cleanly use GPU memory and potentially pass through GPU acceleration to Xwayland without the need to deal with any of this mess at the X11 level.

dcommander commented 4 months ago

I changed my mind and implemented this, since it provides a solution for using Vulkan with the AMDGPU drivers. (Whereas nVidia's Vulkan implementation does something VirtualGL-like when running in TurboVNC, AMD's implementation doesn't work without the DRI3 extension.) Our implementation of DRI3 is based on KasmVNC's implementation, with only minor changes (mostly cosmetic, but I also used an Xorg linked list instead of a fixed array to track the DRI3 pixmaps.)