Open AAAPops opened 5 years ago
I have not had time to test this unfortunately. Do you have access to other ARM devices to see if this is a general issue?
Other distributions might be worth testing as well. Perhaps a proprietary driver is needed for decent performance?
My first guess would be that it is because we use XRENDER for the graphics now, and that is really slow on that board. That would unfortunately mean that we probably can't do anything about it. But let's try to confirm that is the case...
I got the same X server behave (~100% CPU) on x86 platform. All you need is to change standard X server video driver (RADEON for my notebook) to FBDEV driver that is a common solution for ARM boards.
As I wrote earlier "vncviewer ver.1.7.8" work perfect with FBDEV driver. Is it possible to revert the old performance for weak thin clients?
Not really, no. The XRENDER operations are needed to do the layout we want. There might be some tweaks to reduce the load a bit, but nothing major. I'm afraid such weak devices is not something we focus on. :/
I tried quickly here using Xephyr, but could get any real load on it. Could it be a combination with your desktop environment? Have you tried any other one? E.g. compositing or OpenGL might be problematic for such a device.
In both cases (ARM and x86) I use different DE (MATE and xfce) that dosn't require any 3D acceleration because FBDEV driver not support it.
My ARM and x86 devices is not really "weak" the only disadvantage is absence 3D driver support! I guess it's a quite common situation -)
Both of those use compositing though, which tends to push the X server quite hard. Could you check that it is disabled? It is usually under some advanced settings for the window manager.
First of all I have to make a clarification about my installations: ARM + Fluxbox and x86 + MATE Fluxbox doesn't use composition at all and I disable composition in MATE manually.
In both cases X server is still overloaded during vnc session.
Not sure what's going on then. Do you think you could do some tracing of the X server using perf
?
Sure I can. Any specific options or let me make first attempt myself?
Experiment and see if you can find something. You may need to install debug packages to get decent symbol names though.
I've gathered some perf data.
This effect is greatly visible on average vm with qxl graphics. Xorg's cpu usage is several times bigger with tigervnc from master than from branch-1.7.
[root@archdev ~]# perf report -i perf.data_qxl_17 | grep -v ^# | head -n5
46.62% Xorg [kernel.kallsyms] [k] qxl_image_init
32.80% Xorg libc-2.29.so [.] __memmove_avx_unaligned_erms
3.86% Xorg [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.29% Xorg [kernel.kallsyms] [k] find_next_iomem_res
1.29% Xorg [kernel.kallsyms] [k] preempt_count_add
[root@archdev ~]# perf report -i perf.data_qxl_master | grep -v ^# | head -n5
47.22% Xorg libpixman-1.so.0.38.4 [.] sse2_blt.part.0.lto_priv.0
25.91% Xorg libc-2.29.so [.] __memmove_avx_unaligned_erms
14.53% Xorg [kernel.kallsyms] [k] qxl_image_init
2.30% Xorg [kernel.kallsyms] [k] __softirqentry_text_start
1.45% Xorg [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
With fbdev driver the effect is very small, but call to libpixman is still present. Perf data for the history
[root@archdev ~]# perf report -i perf.data_master | grep -v ^# | head -n5
42.74% Xorg libpixman-1.so.0.38.4 [.] sse2_blt.part.0.lto_priv.0
32.93% Xorg libc-2.29.so [.] __memmove_avx_unaligned_erms
10.29% Xorg [kernel.kallsyms] [k] fb_deferred_io_mkwrite
2.91% Xorg [kernel.kallsyms] [k] __softirqentry_text_start
2.42% Xorg [kernel.kallsyms] [k] __do_page_fault
[root@archdev ~]# perf report -i perf.data_17 | grep -v ^# | head -n5
57.68% Xorg libc-2.29.so [.] __memmove_avx_unaligned_erms
27.49% Xorg [kernel.kallsyms] [k] fb_deferred_io_mkwrite
6.20% Xorg [kernel.kallsyms] [k] __do_page_fault
0.54% Xorg [kernel.kallsyms] [k] __handle_mm_fault
0.54% Xorg [kernel.kallsyms] [k] _raw_spin_unlock_irq
On a AMD G-T40N 800 MHz cpu with turbofb it is noticeable.
root@tonk-1502:~/src/tigervnc# perf report -i perf.data_master | grep -v ^# | head -n5
48.87% Xorg libpixman-1.so.0.36.0 [.] sse2_blt.part.0
46.03% Xorg libc-2.28.so [.] __memcpy_ssse3
0.53% Xorg libpixman-1.so.0.36.0 [.] sse2_composite_over_8888_8888
0.13% Xorg [kernel.kallsyms] [k] __hrtimer_run_queues
0.10% Xorg [kernel.kallsyms] [k] try_to_wake_up
root@tonk-1502:~/src/tigervnc# perf report -i perf.data_17 | grep -v ^# | head -n5
95.32% Xorg libc-2.28.so [.] __memcpy_ssse3
0.29% Xorg libshadow.so [.] shadowUpdatePacked
0.17% Xorg [kernel.kallsyms] [k] ktime_get_update_offsets_now
0.14% Xorg [kernel.kallsyms] [k] __update_load_avg_se
0.14% Xorg [kernel.kallsyms] [k] __indirect_thunk_start
A call to pixman on amd with debian is going from here
- 43.36% 43.36% Xorg libpixman-1.so.0.36.0 [.] sse2_blt.part.0
_start
__libc_start_main
dix_main
Dispatch
ProcRenderComposite
damageComposite
fbComposite
pixman_image_composite32
sse2_composite_copy_area
sse2_blt (inlined)
+ sse2_blt (inlined)
Bisecting the sources gives this commit 403ac27d Abstract platform rendering to "surfaces"
I do understand that this is because of the support of blending, alpha and other fancy useless in vnc things, but I cannot find a sane way to disable/replace it in master branch (I do not know xorg-fu yet). @CendioOssman can you elaborate on the possibility of fixing this, please?
So that does confirm it is the XRENDER stuff. It is not something that we can replace though, and there might not be much we can do.
If you play a video of a similar size in Firefox, do you get similar performance issues? It should use the same API.
I assumed that this effect would be veseble on any quickly changing picture. In my test I was using xfce screensaver with about 8 rectangles visible on screen and changing brightness (picture from the internet)
Is it possible for different program (firefox) to behave differently?
Yes, X11 has multiple ways of getting the graphics on screen. Some are simple (but generally fast), and some have more features but have more requirements on the hardware.
perf data for firefox and mpv playing the same video on a vm with Xorg (iwhere vncviewer was tested).
[root@archdev ~]# perf report -i perf.data_ff | grep -v ^# | head -n5
27.61% Xorg libpixman-1.so.0.38.4 [.] sse2_blt.part.0.lto_priv.0
21.47% Xorg [kernel.kallsyms] [k] qxl_image_init
9.20% Xorg [kernel.kallsyms] [k] alloc_vmap_area
6.75% Xorg [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
6.75% Xorg [kernel.kallsyms] [k] qxl_bo_kmap_atomic_page
[root@archdev ~]# perf report -i perf.data_mpv | grep -v ^# | head -n5
29.38% Xorg libc-2.29.so [.] __memmove_avx_unaligned_erms
26.80% Xorg [kernel.kallsyms] [k] qxl_image_init
11.34% Xorg [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
5.15% Xorg [kernel.kallsyms] [k] alloc_vmap_area
4.64% Xorg [kernel.kallsyms] [k] qxl_bo_kmap_atomic_page
While playing with mpv Xorg consumes about 3% of CPU and with firefox consumption is about 1% (but overall firefox consuming a lot more cpu by itself)
In vncviewer case I see both memmove_avx_unaligned_erms
and sse2_blt.part.0.lto_priv.0
working at the same time.
Hmmm.... Perhaps Firefox does a lot of the compositing manually in that case.
I did manage to provoke the issue better here now by putting vncviewer inside another Xvnc. I'm not seeing any memcpy/memmove, but I am seeing two distinct paths to sse2_blt. So I think it's the same thing.
The underlying issue seems to be the simple fact that we are now copying the data a few times more than before. It might be possible to avoid some of that copying, but probably not all and probably not in a trivial way.
The old code had just a single copy: from the VNC buffer directly to the window. This allowed no changes to the data to add anything else, so another model was needed.
The new code has three copies: once from the VNC buffer to a pixmap. Secondly when assembling the complete picture in a back buffer. Thirdly when copying the back buffer to the window.
On most systems those two copies are done by the graphics card, so they cause very little load. In pure software systems they are getting noticeable though.
We might get rid of one copy by doing the compositing in vncviewer. We would lose hardware acceleration, but we don't do anything fancy right now so it might be acceptable. Getting it back to one copy looks impossible in the general case, unless someone comes up with something clever.
It might be worth doing a special case when there are no extra things to composite. But it might be confusing and annoying for the user if things just go to a crawl once anything extra shows up.
We might get rid of one copy by doing the compositing in vncviewer. We would lose hardware acceleration, but we don't do anything fancy right now so it might be acceptable.
Note that this might be short lived. One popular feature request is scaling, which we most likely want to have hardware acceleration for.
Note that this might be short lived. One popular feature request is scaling, which we most likely want to have hardware acceleration for.
What kind of scaling you are talking about? Is it window size scaling or something else? If windows size scaling than it's one time (or so) per session operation, let it be hardware accelerated -)
If the session is smaller than the client window, then some users would like the session to be "zoomed in" until it fills the client window. This tends to by CPU intensive and needs to be done on everything being redrawn.
If the session is smaller than the client window, then some users would like the session to be "zoomed in" until it fills the client window. This tends to by CPU intensive and needs to be done on everything being redrawn.
OK, I see. I agree this is important but special case for some users. While non accelerated X server is a very common problem on modern ARM boards even with proprietary video drivers which don't exist for Xorg server (but only for Wayland for example). Look at the boards based on NXP SOC i.mx8 series.
May I hope new version for non accelerated X server will appear soon? -)
Unfortunately that is not likely right now. It is not something we're prioritising right now. Anyone else is free to look at it though and suggest patches. It's also possible to add a bounty to this issue to attract developers.
Best would be probably to have any fancy GPU features optional via client option, so anyone can choose whether those are needed (at cost of CPU usage when GPU acceleration is missing) or not. However, I guess maintaining two ways of talking to X is too much coding effort now.
But just as a general idea for future changes on top of the basics, it's nice to keep things modular and optional, especially when it implies new dependencies and increased hardware demand.
The "Present" extension might be something to look at as it might avoid one of the copies.
Describe the bug vncviewer ver.1.9.0, vncviewer from "ThinLinc 4.10.0" and vncviewer from "ThinLinc nightly build" load X server almost at 100% on ARM client.
To Reproduce Run VNC session or ThinLinc session In session run Screensaver "Pop art squares" in Fullscreen mode or any movie playback You will see that X server loads CPU up to 100%
Expected behavior In case vncviewer ver.1.7.8 (very old version) X server loads CPU to 40-50% I'm expected that new vncviewer versions stay in this boundaries.
Client (please complete the following information):
Server (please complete the following information):
P. S. In case x86 client everything is Ok!