H.264 support - Githubissues

dcommander commented 9 years ago

Referring to http://www.turbovnc.org/About/H264, H.264 isn't necessarily a good fit for VNC, because H.264 is designed for encoding video content with a fixed frame size, and VNC doesn't really have a concept of frames per se. When using frame-based applications (such as video players and 3D applications, i.e. VirtualGL) within VNC, the frames generated by those applications are usually translated into individual RFB framebuffer updates, but with other applications, that is usually not the case. When typing into a terminal window, for instance, only a few pixels may change as you are typing, but H.264 would require that minimally the entire window (if not the whole desktop) be re-encoded every time even the smallest change occurs. This is necessary because H.264 does its own interframe comparison, and thus each encode operation needs access to the entire frame so that it can determine which pixels have changed and generate the interframe-compressed video stream accordingly. For all intents and purposes, H.264 has a constant compression overhead. Every time you compress a frame, it takes about the same amount of CPU time, regardless of how many pixels have changed.

Furthermore, the x264 open source H.264 encoder is too slow to be viable for remote 3D in most cases. It would be necessary to use NvENC or another GPU-based encoding mechanism (AMD now has VCE, which is similar in concept) to achieve decent performance with H.264, but this creates other challenges. There are generally three approaches that might work for adding H.264 encoding functionality to TurboVNC with decent performance.

Treat the entire VNC desktop as a single H.264 stream As described above, the main issue inherent in this approach is that it is very inefficient for applications that only update small portions of the screen at a time. Even if only a few pixels have changed, the entire framebuffer has to be re-encoded so that the H.264 codec can detect those changes. Thus, it is likely that this approach would only really make sense if the frame rate was somehow limited. That is, one would want to specify that no more than, say, 30 frames/second would be encoded, so that a lot of small updates could be coalesced into larger updates. The existing deferred update timer mechanism in VNC might provide a means of accomplishing this, although past experience has proven that mechanism to be problematic in terms of performance (due to the fact that the server is single-threaded and due to how it processes updates for the viewers-- really long story.) The deferred update timer mechanism would likely have to be revisited if it were to serve as a sort of frame rate governor.

The other problem is that TurboVNC's framebuffer sits in main memory. We would ideally want to avoid transmitting the 3D pixels over the bus three times (drawing the uncompressed pixels from VirtualGL into Xvnc, uploading the uncompressed pixels back to the GPU for compression, and downloading the compressed H.264 stream back into main memory for transmission.) One way around that might be simply to store the VNC framebuffer in GPU memory. If VirtualGL knew about the framebuffer, it could draw the 3D pixels directly into it (notifying VirtualGL of the framebuffer's existence might be accomplished through the use of an X root window property.) It is unclear, however, how or if this GPU-based framebuffer would work with regular X drawing operations. In all likelihood, the X server would need to lock the GPU memory region prior to drawing and unlock it afterwards, which is likely to create a great deal of overhead, not to mention the overhead associated with moving the fine-grained X data back and forth to and from GPU memory (it goes without saying that X will do a lot of read-modify-write operations.) It is also unclear whether this might create GPU contention with 3D applications or, with a large number of users, whether it might exhaust GPU memory. Thinking out loud, though, doesn't DRI already accomplish something similar to this, except using a visible as opposed to an offscreen framebuffer on the GPU?

Another approach would be to keep the VNC framebuffer in main memory but to create a YUV "holding buffer" on the GPU. In this case, only the updated rectangles would be sent down to the GPU. VirtualGL could also generate YUV images and write them directly into the holding buffer, but again, it would have to know of the existence of that buffer. Also, if X needed to access pixels generated by OpenGL, it would have to inform VirtualGL of that fact so that VirtualGL could draw a copy of the frame using the X11 Transport.
Pre-encode the H.264 stream in VirtualGL This makes more sense, since VirtualGL is generating a frame-based image stream that is a more natural fit for a video codec such as H.264. This eliminates the issues of efficiency and the need to encode at a fixed frame rate, because it is assumed that-- at least with double-buffered OpenGL applications-- each frame sent through VirtualGL will share few pixels with the previous frame (but, in some cases, the differences will be within the scope of H.264's predictive abilities.)

Encoding the video stream is easy, because the pixels are already on the GPU. VirtualGL would simply encode them using NvENC or similar and transmit the H.264 stream directly from GPU memory. But that's where things get dicey. How would we transmit the stream through the TurboVNC Server and to the client? We could implement some sort of "compressed PutImage extension", whereby the compressed stream could be passed through unmodified by the TurboVNC Server and decompressed by the viewer, but this introduces all-new problems:
- How would we handle window overlapping? The hypothetical compressed PutImage extension would have to communicate the structure of an overlapped window image back to VirtualGL so that it could break the image down into component rectangles and send each separately, or perhaps we could just make the simplifying assumption that, if the window is overlapped or obscured, H.264 will be temporarily disabled for that window.
- How would we handle combined OpenGL and X operations? An OpenGL/X11 application is well within its rights to request a copy of the OpenGL-rendered pixels using X11 functions (XGetImage(), for instance), assuming that it has called glXWaitGL() to ensure that the pixels have been delivered to the X server. So how would the TurboVNC Server accommodate that request if the OpenGL-rendered pixels are being passed through as an encoded H.264 stream. It would have to keep a copy of the video stream all the way back to the last I-frame, or it would have to somehow notify VirtualGL that it needs an uncompressed copy of the current frame (problematic, since VirtualGL may not have it anymore), or VirtualGL would have to deliver two copies of the frame to the X server-- one compressed and one uncompressed. Delivering two copies isn't a huge deal, since the current VirtualGL/TurboVNC solution already delivers an uncompressed version of the frame. The increase in bus usage would be only incremental, due to the addition of delivering the compressed H.264 version of the frame.
- How would we ensure that the viewer can handle H.264? This is normally negotiated by the VNC server, and since it has access to all of the framebuffer pixels, it can (and does) encode those pixels differently for each viewer. Pre-encoding in VirtualGL would require passing some information about the connected viewers back to VirtualGL, so it could determine whether it is OK to use H.264. Also, since the viewer can request a new copy of any region of the remote desktop at any time (and does, via the Lossless Refresh and Refresh features), the server would have to always maintain a copy of the H.264 pixels in uncompressed form, per above. Maintaining a copy of the H.264 pixels in uncompressed form would also be necessary in order to support Automatic Lossless Refresh.
Deferred readback and encoding (PBO PutImage extension) VirtualGL/virtualgl#9 proposes a mechanism for deferring the readback of OpenGL pixels so that readback doesn't occur in the rendering thread but instead occurs in the image transport-- basically the pixels would be copied into a PBO instead of read back, and the image transport would access the PBO and compress the pixels directly from GPU memory (or perhaps even compress the pixels with the GPU.)

This could be taken one step further, and instead of the image transport compressing/transmitting the pixels, it could conceivably pass a PBO handle to TurboVNC using some as-yet-to-be-defined custom X extension. This extension could work similarly to MIT-SHM, except that it would allow a PBO handle to be passed in rather than a shared memory segment ID. The TurboVNC Server would then take that handle and copy the pixels into its own framebuffer as well as generate an H.264 frame from them. This would require only an incremental amount of additional bus traffic (the copying of uncompressed pixels from graphics memory into main memory is already occurring now. The only addition here would be copying the compressed H.264 frame from graphics memory into main memory.)

The main advantage this has over Approach 2 is that it allows the TurboVNC Server to handle the pixels as it sees fit, rather than enforcing a particular encoding scheme upon it. This greatly simplifies the implementation, since the server can decide whether it wants to use H.264 based on whether a particular viewer supports it, and it can decide to temporarily turn off H.264 and transfer only the unobscured rectangles from an obscured window, etc.

One concern here, however, is synchronization. VirtualGL cannot grant the TurboVNC Server access to the PBO for an unlimited period of time, since VirtualGL will only have a limited pool of PBOs (no more than 3) to work with. This is awkward at best, since the TurboVNC Server doesn't necessarily generate a framebuffer update immediately when VirtualGL draws a frame. The VNC server basically acts as another layer of frame spoiling, since it can coalesce multiple frames from VirtualGL into one framebuffer update as a result of the deferred update timer or as a result of the RFB flow control extensions (which prevent updates from being sent faster than the network or viewer can handle them.) At first glance, it might seem possible to make the proposed PBO PutImage extension asynchronous and thereby essentially treat the TurboVNC Server as VirtualGL's image transport thread. In other words, VirtualGL would, within the application rendering thread, use the PBO PutImage extension to request a free PBO from the pool, and TurboVNC would block on that request until a PBO is free, then VirtualGL would fill the PBO with pixels and send back another request notifying TurboVNC that the PBO is ready to transmit. However, that scheme is likely not possible due to the fact that the TurboVNC Server is single-threaded (as are all X servers.) It will probably be necessary for the TurboVNC Server to pre-compress the H.264 pixels within the body of the PBO PutImage function and then to just store those pixels in a holding buffer until the next RFB framebuffer update.

Approach #3 seems to be the most promising, but I suspect it would take hundreds of hours of labor to make it happen, and in the grand scheme of things, it may make more sense to wait for Wayland, since Wayland's architecture is much more conducive to the use of frame-based codecs such as H.264 (refer to TurboVNC/turbovnc#18) and probably GPU-based encoding as well. Furthermore, referring to the article on TurboVNC.org, H.264 doesn't necessarily benefit all types of applications. It is clear that it can benefit applications like video players, Google Earth, games, etc., but for ordinary CAD applications, the jury is still out.

There are additional challenges inherent with decoding the H.264 stream with reasonable performance. As with JPEG, it would likely be necessary to use some sort of H.264 decoder accessed through JNI in the Java viewers, or perhaps to leverage the built-in decoders on some GPUs (if available.)

dcommander commented 8 years ago

After discussing with nVidia, it seems that Proposal 3 above is possible, but PBO handles cannot be passed between processes. Assuming that VirtualGL is creating the PBO and shipping it to TurboVNC for compression, it will be necessary for VirtualGL to create a CUDA pointer from the PBO, create a CUDA IPC handle from the CUDA pointer, and ship the CUDA IPC handle to the X server (via some as-yet-to-be-defined X extension, probably one that is not "official" but just used within VirtualGL and TurboVNC.) TurboVNC would then invoke NvENC on the memory region pointed to by the CUDA IPC handle.

dcommander commented 7 years ago

I'm currently working, on a somewhat low priority, on adding NVENC support to the TurboVNC Benchmark Tools so I can get an idea of how the "naive" approach-- simply invoking NVENC from within TurboVNC on the entire framebuffer-- might perform. Working with NVENC is somewhat difficult, given that the convenience classes for it aren't open source. Thus, it's necessary to access the low-level functions directly. I haven't been able to return to that effort in recent months due to other pressing concerns, but I hope to get back to it before the end of the year.

If the naive approach seems like it's worth pursuing, then it may be possible to get that into TurboVNC 2.2. If the naive approach isn't worth pursuing, then in all likelihood, this will get punted until a TurboVNC Wayland compositor is developed, unless an organization steps up to fund the necessary architectural modifications to VGL and TurboVNC (likely to be a rather large and expensive effort.)

T-vK commented 5 years ago

I think H.264 would be a very valuable feature. Using Nvidia's GameStream or whatever it's called I was able to stream my desktop to my Nvidia shield tablet with virtually no delay. Before that I tried Windows Remote Destop, TeamViewer and tons of VNC clients and they all introdued significantly higher latencies which made steaming fps games and such completely unenjoyable. Audio/video sync was also a major issue for me because my speakers were still hooked up directly to my PC. With Nvidia GameStream all these problems went away.

I would imagine performance-wise this would only make sense if you have the hardware encoders/decoders on both ends, but I think Nvidia has started added it to all their GPUs starting 4 years ago and AMD apparently has their own version now.

It would be really amazing to be able to do that on Linux as well.

Btw have you made any progress?

dcommander commented 5 years ago

Unfortunately not much. As an independent open source developer who makes money only through support and funded development contracts, I’m constantly having to formulate strategies for the product that will maintain its high quality and usability standards and create meaningful improvements for users while still driving business my way in the form of funded development of new features. As such, I have to be careful not to give away so much free milk that no one wants to buy the cow. I committed long ago to do the initial analysis on the H.264 project using General Fund money, but I haven’t managed to find the time to get back to it. Unfortunately some corporate players have, in recent years, started aggressively trying to take customers away from me by cutting into the market space that I busted my butt to build, so now I’m busting my butt yet again to reposition TurboVNC 3.0 so it can compete with these corporate players. I hope to get back to more long-term pursuits like this and Wayland support, etc., once I’m out of survival mode.

dcommander commented 5 years ago

Also, you mention the latency problems with other VNC solutions, but TurboVNC has specifically implemented RFB extensions that should improve that situation. Have you tried TurboVNC as it exists today?

T-vK commented 5 years ago

I don't think I've tried TurboVNC yet. Back then, I was on Windows and now I'm on Fedora... which means I'm on Wayland. :/

dcommander commented 5 years ago

It means you're on Wayland for the root display, unless you configure it otherwise (Fedora still supports the traditional mode, just not by default), but that doesn't prevent you from using an X proxy such as TurboVNC, since it is not dependent on the root display.

mdevaev commented 3 years ago

Hi. I am working on a Raspberry based hardware IP KVM project and I have written my own VNC server implementing JPEG compression. Now I implement H264 and have already made a server for video encoding with low latency, it remains only to wrap the H264 frames to VNC protocol. What is the current situation with H264 in the TurboVNC client side? I would be happy if this client could support H264 with my project.

dcommander commented 3 years ago

@mdevaev H.264 has not been implemented yet in TurboVNC and won't make it into the upcoming 3.0 release. Which specification did you use for RFB-encoding the H.264 framebuffer updates? I wasn't able to find a definitive specification for an H.264 RFB encoding type.

mdevaev commented 3 years ago

At the moment, none. It seems that we (if you decide to support it) can use whatever we want. I would start from the registered encoding with code 20. But maybe it would be great to have some kind of service pseudo-encodings to regulate the bitrate like for the Tight JPEG.

Before implementing the support myself, I decided to investigate the issue and ask you about your plans.

tian2992 commented 6 months ago

Wonder, how is this supported at present point, 3.1.1? from my understanding it would be usable, yet not enabled by default..?

dcommander commented 6 months ago

@tian2992 It isn't. The feature is problematic in a VNC environment, for reasons mentioned above. Either we would have to implement it using a slow software-only library, such as libx264, or we would have to implement it using a proprietary GPU-based library such as NVENC. The latter is legally problematic because of GPL compatibility, and it is technically problematic because of the need to transport pixels from the TurboVNC framebuffer (in main memory) to the GPU for encoding and then transport the encoded frames back to main memory for transmission to clients. (This is even more problematic when using VirtualGL, since the pixels would have already been transported from the GPU to main memory by the time TurboVNC encodes them.) tl;dr: H.264 is too slow without GPU-based compression, but GPU-based compression would require either a GPU-resident framebuffer (which will be possible with Wayland but isn't possible with Xvnc) or a complicated X11 pass-through mechanism for the GPU-encoded frames. Thus, this feature isn't likely to land until/unless TurboVNC supports Wayland, which has its own set of difficulties. (See #18.) Referring to https://turbovnc.org/About/H264, there are only certain types of workloads that benefit from H.264 (relative to the current TurboVNC encoder.) Pure JPEG encoding, particularly when combined with progressive JPEG (which has a better compression ratio than baseline JPEG) would probably give us most of the advantages of H.264 for those workloads without the headache. (See #376.)

TurboVNC / turbovnc

H.264 support #19