TurboVNC / turbovnc

Main TurboVNC repository
https://TurboVNC.org
GNU General Public License v2.0
761 stars 138 forks source link

TurboVNC on Raspberry PI3, questions #278

Closed martin19 closed 3 years ago

martin19 commented 3 years ago

Hi @dcommander,

I'm looking for a good solution to remotely access my raspberry pi in my home network. After trying out several solutions including various VNC variants, nomachine and rdp I came across TurboVNC which looked promising.

I'm a bit spoiled to see what remote access is capable of today by using Google's stadia for a while which delivers constant 30+ fps through a wireless internet connection in full hd on my windows pc. I gave turboVNC a try, compiled it on my pi, and tried it with several settings. I can happily confirm this vnc solution outdoes the others I've tried in snappiness which is most important for me. Image quality across different jpeg settings varies of course but that I could live with.

However I have a feeling the result is not yet optimal in what is possible with this setup, so I wanted to ask you about your opinion. (I'm sorry if github is not the place for discussion, please close this issue and redirect me to a place more suitable if you like)

My setup is this:

Please have a look at this vide I've recorded (sorry for the video quality, I didnt want to spoil the performance with screen recording software):

https://youtu.be/8DiJbkAVOKM

I've carried out some test here:

The setting is medium quality in TurboVNC client. You can see here the ui is surprisingly responsive in most cases, however when large areas of the screen change quickly (e.g. moving around a large window) the ui even hangs for a while - which is a bad experience. The menu is very responsive and good to use, so I feel the lag could be minimized to this point somehow.

My Questions are: Where do you think is the bottleneck for this lagging and what could be done to resolve this?

If you own a raspberry pi you'll know directly accessing it the desktop directly it is very snappy, I'd say for the testcase above everything would run at 30fps+ (maybe even 60). The cpu meter does not show very much cpu usage. My wireless connection is usually stable and handles around 20MBits I'd guess. In another test I ran tests sending a mjpeg stream (1920x1080) across from the pi to my pc with videolan which worked at 30fps so I guess the bandwidth is there after all.

Please let me know what you think and what can be done to improve the performance of this setup. I've found out there is a hardware accelerated jpeg encoder in the pi (MMAL) which could probably be integrated in TurboVNC, however I'm totally not sure if this could improve things.

Thank you, Martin


Additional Note: I did some actual measurement - it seems the problem is cpu bound. When moving windows around the cpu meter goes between 25%-30% which probably means one cpu is getting to its limit. The network bandwith seems to be sufficient for the task at around 1.5MiB/s. Maybe this is an argument for MMAL or multithreading support? As copyrect encoding is enabled i still wonder why moving windows costs so much?

dcommander commented 3 years ago

The TurboVNC Server already has multithreading support, and it is enabled by default on multi-core systems. When you connect to the server, it will log (in ~/.vnc/{host}-{display}.log) the number of threads it is using, so the first thing I would check is whether it is using the same number of threads as there are cores on the Raspberry Pi host.

The second thing I would check is libjpeg-turbo. I would strongly suggest building the libjpeg-turbo 2.1 beta1 release from source using a relatively recent version of Clang. libjpeg-turbo 2.1 has much better performance than libjpeg-turbo 2.0.x on Arm hardware, particularly 32-bit Arm hardware. You didn't mention which O/S you are running on the Raspberry Pi host, but if you are using the system-supplied version of libjpeg-turbo, it is probably an older version with poor Arm SIMD support. A lot of the libjpeg-turbo algorithms were not SIMD-accelerated on AArch32 systems until libjpeg-turbo v2.1.x.

martin19 commented 3 years ago

Hi, thanks for your response. I've checked the logs and found multithreading is using 4 cores. My interpretation of the cpu-meter I use was incorrect - the 4 cores are actually used - I suppose to a large part by Xvnc - when moving the (large) window. It looks as if an individual core gets maxed out (100%) while the other cores remain with a relatively low load (~20%) when the window glitches/stucks - but its hard to tell.

I've built libjpeg-turbo.2.06 and built it on "Linux pi 4.9.35-v7+ #1014 SMP Fri Jun 30 14:47:43 BST 2017 armv7l GNU/Linux" . So I think I'll update the system and give the beta a try.

Here is the log:

TurboVNC Server (Xvnc) 32-bit v2.2.6 (build 20210316)
Copyright (C) 1999-2021 The VirtualGL Project and many others (see README.txt)
Visit http://www.TurboVNC.org for more information on TurboVNC

17/03/2021 09:01:31 Enabled security type 'vnc'
17/03/2021 09:01:31 Enabled security type 'otp'
17/03/2021 09:01:31 Enabled security type 'unixlogin'
17/03/2021 09:01:31 Enabled security type 'plain'
17/03/2021 09:01:31 Desktop name 'TurboVNC: pi:1 (pi)' (pi:1)
17/03/2021 09:01:31 Protocol versions supported: 3.3, 3.7, 3.8, 3.7t, 3.8t
17/03/2021 09:01:31 Listening for VNC connections on TCP port 5901
17/03/2021 09:01:31   Interface 192.168.0.202
17/03/2021 09:01:31 Listening for HTTP connections on TCP port 5801
17/03/2021 09:01:31   URL http://pi:5801
17/03/2021 09:01:31   Interface 192.168.0.202
17/03/2021 09:01:31 Framebuffer: BGRX 8/8/8/8
17/03/2021 09:01:31 New desktop size: 1920 x 1080
17/03/2021 09:01:31 New screen layout:
17/03/2021 09:01:31   0x00000040 (output 0x00000040): 1920x1080+0+0
17/03/2021 09:01:31 Maximum clipboard transfer size: 1048576 bytes
17/03/2021 09:01:31 VNC extension running!

17/03/2021 09:01:58 Got connection from client 192.168.0.80
17/03/2021 09:01:58 Using protocol version 3.8
17/03/2021 09:01:58 Enabling TightVNC protocol extensions
17/03/2021 09:01:58 Advertising Tight auth cap 'VNCAUTH_'
17/03/2021 09:01:58 Advertising Tight auth cap 'ULGNAUTH'
17/03/2021 09:01:58 Advertising Tight auth cap 'VENCRYPT'
17/03/2021 09:02:02 Full-control authentication enabled for 192.168.0.80
17/03/2021 09:02:02 Pixel format for client 192.168.0.80:
17/03/2021 09:02:02   32 bpp, depth 24, little endian
17/03/2021 09:02:02   true colour: max r 255 g 255 b 255, shift r 16 g 8 b 0
17/03/2021 09:02:02   no translation needed
17/03/2021 09:02:02 Using tight encoding for client 192.168.0.80
17/03/2021 09:02:02 Interframe comparison enabled
17/03/2021 09:02:02 Enabling full-color cursor updates for client 192.168.0.80
17/03/2021 09:02:02 Enabling cursor position updates for client 192.168.0.80
17/03/2021 09:02:02 Using JPEG subsampling 0, Q92 for client 192.168.0.80
17/03/2021 09:02:02 Using JPEG quality 80 for client 192.168.0.80
17/03/2021 09:02:02 Using JPEG subsampling 2 for client 192.168.0.80
17/03/2021 09:02:02 Enabling LastRect protocol extension for client 192.168.0.80
17/03/2021 09:02:02 Enabling Desktop Size protocol extension for client 192.168.                                     0.80
17/03/2021 09:02:02 Enabling Extended Desktop Size protocol extension for client                                      192.168.0.80
17/03/2021 09:02:02 Enabling Continuous Updates protocol extension for client 19                                     2.168.0.80
17/03/2021 09:02:02 Enabling Fence protocol extension for client 192.168.0.80
17/03/2021 09:02:02 Enabling GII protocol extension for client 192.168.0.80
17/03/2021 09:02:02 Using Tight compression level 1 for client 192.168.0.80
17/03/2021 09:02:02 Using 4 threads for Tight encoding
17/03/2021 09:02:13 Client supports GII version 1
17/03/2021 09:02:13 Continuous updates enabled
17/03/2021 09:03:18 Client 192.168.0.80 gone
17/03/2021 09:03:18 Statistics:
17/03/2021 09:03:18   key events received 37, pointer events 1485
17/03/2021 09:03:18   framebuffer updates 580, rectangles 5549, bytes 5212
17/03/2021 09:03:18     LastRect markers 306, bytes 3672
17/03/2021 09:03:18     cursor shape updates 30, bytes 60706
17/03/2021 09:03:18     cursor position updates 1, bytes 12
17/03/2021 09:03:18     Tight rectangles 5212, bytes 4053876
17/03/2021 09:03:18   raw equivalent 228.520700 Mbytes, compression ratio 46.032                                     490
dcommander commented 3 years ago

Yeah, depending on the compiler, I observed generally 30-70% better 32-bit compression performance on a quad-core Cortex-A53 with libjpeg-turbo 2.1 beta1 relative to libjpeg-turbo 2.0.6. It should be noticeable. If you are still experiencing a lag after upgrading libjpeg-turbo, then try reducing the thread count by passing -nthreads 2 or -nthreads 1 to /opt/TurboVNC/bin/vncserver. The Cortex-A53 that RPI3 uses has an in-order instruction pipeline, and depending on other aspects of the system architecture (particularly the cache design, memory bandwidth, and bus bandwidth), it's possible that it may perform better with fewer threads.

martin19 commented 3 years ago

After 2 distupgrades (jessie -> stretch -> buster) the version of TurboVNC I've compiled before runs significantly smoother. The distribution of workload to cores seems to work much better - there are less spikes on individual cores. I've rebuilt libjpeg-turbo 2.0.90 using clang 9.0 - rebuilt the server with it - but didn't notice any significant changes. Going down to -nthreads 1 seems to work equally well (maybe even slightly better?) like without the parameter. After all the expierence is far less choppy and I'd call it "ok to work with".

Can you give an estimate how much of the processing workload can be attributed to jpeg compression in TurboVNC? I wonder about how much the "rest" of the compression algorithm(s) takes.

dcommander commented 3 years ago

The answer to that is complicated, and if you really want to dig into it, this article is a good place to start. The TurboVNC encoder is a variant of the TightVNC encoder, so whenever it receives a new rectangle to encode, the first thing it does is analyze the rectangle. If the rectangle contains significant subrectangles that are all one color, then those subrectangles are encoded as a bounding box and fill color. For the remaining subrectangles, the encoder counts the number of unique colors in each. If the number of unique colors is two, then the subrectangle is encoded using a 1-bit bitmap and a 2-deep color palette, then zlib-compressed. If the number of unique colors is greater than 2 but less than the "palette threshold" (24 for Compression Levels 1 and 6, 96 for Compression Levels 2 and 7, 256 for Compression Level 9), then the subrectangle is encoded as an 8-bit bitmap and an N-deep color palette, then zlib-compressed. If the number of unique colors is greater than the palette threshold, then the subrectangle is compressed using libjpeg-turbo.

In other words, the amount of processing that can be attributed to JPEG compression is highly variable and depends on the type of application that is being used in the TurboVNC session.

The TurboVNC User's Guide describes the various compression levels in detail. In general terms:

Bear in mind that the TurboVNC encoder was designed in 2008, when there were a lot more low-color applications afoot and a lot more applications that used raw X11 primitives. These days, most X11 applications render into an image buffer and draw the image to the screen, so they tend to benefit more from interframe comparison and JPEG than the applications of old.

A lot of Arm development boards, including the Rock960 that I personally use, contain four in-order cores (e.g. Cortex-A53) and two out-of-order cores (e.g. Cortex-A57, Cortex-A72, etc.) The RPI3 only contains a 1.2 GHz quad-core Cortex-A53, so it is underpowered from TurboVNC's point of view. It is entirely possible that either libjpeg-turbo or zlib or both are the primary bottlenecks. Several things you can try, to reduce server CPU usage:

I would suggest playing around with those various compression settings in the TurboVNC Viewer and using the built-in profiling feature to determine which are the most beneficial. You're mostly in uncharted territory here, since the TurboVNC Server is designed more for high-spec server systems.

dcommander commented 3 years ago

If your wi-fi is really fast, you might also try disabling compression altogether, by using the "Lossless Tight" preset mode. (This disables JPEG and uses Compression Level 0, which is a special mode that bypasses zlib.) That mode was specifically designed for using underpowered server CPUs (even more specifically, old SPARC CPUs that we had to support during the years that I developed TurboVNC as a Sun Microsystems product) over gigabit networks.