TigerVNC / tigervnc

High performance, multi-platform VNC client and server
https://tigervnc.org
GNU General Public License v2.0
4.86k stars 909 forks source link

vncviewer slow with unaccelerated X server (e.g. fbdev) #839

Open AAAPops opened 5 years ago

AAAPops commented 5 years ago

Describe the bug vncviewer ver.1.9.0, vncviewer from "ThinLinc 4.10.0" and vncviewer from "ThinLinc nightly build" load X server almost at 100% on ARM client.

To Reproduce Run VNC session or ThinLinc session In session run Screensaver "Pop art squares" in Fullscreen mode or any movie playback You will see that X server loads CPU up to 100%

Expected behavior In case vncviewer ver.1.7.8 (very old version) X server loads CPU to 40-50% I'm expected that new vncviewer versions stay in this boundaries.

Client (please complete the following information):

Server (please complete the following information):

P. S. In case x86 client everything is Ok!

CendioOssman commented 5 years ago

I have not had time to test this unfortunately. Do you have access to other ARM devices to see if this is a general issue?

Other distributions might be worth testing as well. Perhaps a proprietary driver is needed for decent performance?

My first guess would be that it is because we use XRENDER for the graphics now, and that is really slow on that board. That would unfortunately mean that we probably can't do anything about it. But let's try to confirm that is the case...

AAAPops commented 5 years ago

I got the same X server behave (~100% CPU) on x86 platform. All you need is to change standard X server video driver (RADEON for my notebook) to FBDEV driver that is a common solution for ARM boards.

As I wrote earlier "vncviewer ver.1.7.8" work perfect with FBDEV driver. Is it possible to revert the old performance for weak thin clients?

CendioOssman commented 5 years ago

Not really, no. The XRENDER operations are needed to do the layout we want. There might be some tweaks to reduce the load a bit, but nothing major. I'm afraid such weak devices is not something we focus on. :/

I tried quickly here using Xephyr, but could get any real load on it. Could it be a combination with your desktop environment? Have you tried any other one? E.g. compositing or OpenGL might be problematic for such a device.

AAAPops commented 5 years ago

In both cases (ARM and x86) I use different DE (MATE and xfce) that dosn't require any 3D acceleration because FBDEV driver not support it.

My ARM and x86 devices is not really "weak" the only disadvantage is absence 3D driver support! I guess it's a quite common situation -)

CendioOssman commented 5 years ago

Both of those use compositing though, which tends to push the X server quite hard. Could you check that it is disabled? It is usually under some advanced settings for the window manager.

AAAPops commented 5 years ago

First of all I have to make a clarification about my installations: ARM + Fluxbox and x86 + MATE Fluxbox doesn't use composition at all and I disable composition in MATE manually.

In both cases X server is still overloaded during vnc session.

CendioOssman commented 5 years ago

Not sure what's going on then. Do you think you could do some tracing of the X server using perf?

AAAPops commented 5 years ago

Sure I can. Any specific options or let me make first attempt myself?

CendioOssman commented 5 years ago

Experiment and see if you can find something. You may need to install debug packages to get decent symbol names though.

yekm commented 4 years ago

I've gathered some perf data.

This effect is greatly visible on average vm with qxl graphics. Xorg's cpu usage is several times bigger with tigervnc from master than from branch-1.7.

[root@archdev ~]# perf report -i perf.data_qxl_17 | grep -v ^# | head -n5
    46.62%  Xorg     [kernel.kallsyms]  [k] qxl_image_init
    32.80%  Xorg     libc-2.29.so       [.] __memmove_avx_unaligned_erms
     3.86%  Xorg     [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     1.29%  Xorg     [kernel.kallsyms]  [k] find_next_iomem_res
     1.29%  Xorg     [kernel.kallsyms]  [k] preempt_count_add
[root@archdev ~]# perf report -i perf.data_qxl_master | grep -v ^# | head -n5
    47.22%  Xorg     libpixman-1.so.0.38.4  [.] sse2_blt.part.0.lto_priv.0
    25.91%  Xorg     libc-2.29.so           [.] __memmove_avx_unaligned_erms
    14.53%  Xorg     [kernel.kallsyms]      [k] qxl_image_init
     2.30%  Xorg     [kernel.kallsyms]      [k] __softirqentry_text_start
     1.45%  Xorg     [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore

With fbdev driver the effect is very small, but call to libpixman is still present. Perf data for the history

[root@archdev ~]# perf report -i perf.data_master | grep -v ^# | head -n5
    42.74%  Xorg     libpixman-1.so.0.38.4  [.] sse2_blt.part.0.lto_priv.0
    32.93%  Xorg     libc-2.29.so           [.] __memmove_avx_unaligned_erms
    10.29%  Xorg     [kernel.kallsyms]      [k] fb_deferred_io_mkwrite
     2.91%  Xorg     [kernel.kallsyms]      [k] __softirqentry_text_start
     2.42%  Xorg     [kernel.kallsyms]      [k] __do_page_fault
[root@archdev ~]# perf report -i perf.data_17 | grep -v ^# | head -n5
    57.68%  Xorg     libc-2.29.so           [.] __memmove_avx_unaligned_erms
    27.49%  Xorg     [kernel.kallsyms]      [k] fb_deferred_io_mkwrite
     6.20%  Xorg     [kernel.kallsyms]      [k] __do_page_fault
     0.54%  Xorg     [kernel.kallsyms]      [k] __handle_mm_fault
     0.54%  Xorg     [kernel.kallsyms]      [k] _raw_spin_unlock_irq

On a AMD G-T40N 800 MHz cpu with turbofb it is noticeable.

root@tonk-1502:~/src/tigervnc# perf report -i perf.data_master | grep -v ^# | head -n5
    48.87%  Xorg     libpixman-1.so.0.36.0  [.] sse2_blt.part.0
    46.03%  Xorg     libc-2.28.so           [.] __memcpy_ssse3
     0.53%  Xorg     libpixman-1.so.0.36.0  [.] sse2_composite_over_8888_8888
     0.13%  Xorg     [kernel.kallsyms]      [k] __hrtimer_run_queues
     0.10%  Xorg     [kernel.kallsyms]      [k] try_to_wake_up
root@tonk-1502:~/src/tigervnc# perf report -i perf.data_17 | grep -v ^# | head -n5
    95.32%  Xorg     libc-2.28.so           [.] __memcpy_ssse3
     0.29%  Xorg     libshadow.so           [.] shadowUpdatePacked
     0.17%  Xorg     [kernel.kallsyms]      [k] ktime_get_update_offsets_now
     0.14%  Xorg     [kernel.kallsyms]      [k] __update_load_avg_se
     0.14%  Xorg     [kernel.kallsyms]      [k] __indirect_thunk_start

A call to pixman on amd with debian is going from here

-   43.36%    43.36%  Xorg     libpixman-1.so.0.36.0  [.] sse2_blt.part.0
     _start
     __libc_start_main
     dix_main
     Dispatch
     ProcRenderComposite
     damageComposite
     fbComposite
     pixman_image_composite32
     sse2_composite_copy_area
     sse2_blt (inlined)
   + sse2_blt (inlined)

Bisecting the sources gives this commit 403ac27d Abstract platform rendering to "surfaces"

I do understand that this is because of the support of blending, alpha and other fancy useless in vnc things, but I cannot find a sane way to disable/replace it in master branch (I do not know xorg-fu yet). @CendioOssman can you elaborate on the possibility of fixing this, please?

CendioOssman commented 4 years ago

So that does confirm it is the XRENDER stuff. It is not something that we can replace though, and there might not be much we can do.

If you play a video of a similar size in Firefox, do you get similar performance issues? It should use the same API.

yekm commented 4 years ago

I assumed that this effect would be veseble on any quickly changing picture. In my test I was using xfce screensaver with about 8 rectangles visible on screen and changing brightness (picture from the internet)

Is it possible for different program (firefox) to behave differently?

CendioOssman commented 4 years ago

Yes, X11 has multiple ways of getting the graphics on screen. Some are simple (but generally fast), and some have more features but have more requirements on the hardware.

yekm commented 4 years ago

perf data for firefox and mpv playing the same video on a vm with Xorg (iwhere vncviewer was tested).

[root@archdev ~]# perf report -i perf.data_ff | grep -v ^# | head -n5
    27.61%  Xorg     libpixman-1.so.0.38.4  [.] sse2_blt.part.0.lto_priv.0
    21.47%  Xorg     [kernel.kallsyms]      [k] qxl_image_init
     9.20%  Xorg     [kernel.kallsyms]      [k] alloc_vmap_area
     6.75%  Xorg     [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
     6.75%  Xorg     [kernel.kallsyms]      [k] qxl_bo_kmap_atomic_page
[root@archdev ~]# perf report -i perf.data_mpv | grep -v ^# | head -n5
    29.38%  Xorg     libc-2.29.so       [.] __memmove_avx_unaligned_erms
    26.80%  Xorg     [kernel.kallsyms]  [k] qxl_image_init
    11.34%  Xorg     [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     5.15%  Xorg     [kernel.kallsyms]  [k] alloc_vmap_area
     4.64%  Xorg     [kernel.kallsyms]  [k] qxl_bo_kmap_atomic_page

While playing with mpv Xorg consumes about 3% of CPU and with firefox consumption is about 1% (but overall firefox consuming a lot more cpu by itself)

In vncviewer case I see both memmove_avx_unaligned_erms and sse2_blt.part.0.lto_priv.0 working at the same time.

CendioOssman commented 4 years ago

Hmmm.... Perhaps Firefox does a lot of the compositing manually in that case.

I did manage to provoke the issue better here now by putting vncviewer inside another Xvnc. I'm not seeing any memcpy/memmove, but I am seeing two distinct paths to sse2_blt. So I think it's the same thing.

The underlying issue seems to be the simple fact that we are now copying the data a few times more than before. It might be possible to avoid some of that copying, but probably not all and probably not in a trivial way.

The old code had just a single copy: from the VNC buffer directly to the window. This allowed no changes to the data to add anything else, so another model was needed.

The new code has three copies: once from the VNC buffer to a pixmap. Secondly when assembling the complete picture in a back buffer. Thirdly when copying the back buffer to the window.

On most systems those two copies are done by the graphics card, so they cause very little load. In pure software systems they are getting noticeable though.

We might get rid of one copy by doing the compositing in vncviewer. We would lose hardware acceleration, but we don't do anything fancy right now so it might be acceptable. Getting it back to one copy looks impossible in the general case, unless someone comes up with something clever.

It might be worth doing a special case when there are no extra things to composite. But it might be confusing and annoying for the user if things just go to a crawl once anything extra shows up.

CendioOssman commented 4 years ago

We might get rid of one copy by doing the compositing in vncviewer. We would lose hardware acceleration, but we don't do anything fancy right now so it might be acceptable.

Note that this might be short lived. One popular feature request is scaling, which we most likely want to have hardware acceleration for.

AAAPops commented 4 years ago

Note that this might be short lived. One popular feature request is scaling, which we most likely want to have hardware acceleration for.

What kind of scaling you are talking about? Is it window size scaling or something else? If windows size scaling than it's one time (or so) per session operation, let it be hardware accelerated -)

CendioOssman commented 4 years ago

If the session is smaller than the client window, then some users would like the session to be "zoomed in" until it fills the client window. This tends to by CPU intensive and needs to be done on everything being redrawn.

AAAPops commented 4 years ago

If the session is smaller than the client window, then some users would like the session to be "zoomed in" until it fills the client window. This tends to by CPU intensive and needs to be done on everything being redrawn.

OK, I see. I agree this is important but special case for some users. While non accelerated X server is a very common problem on modern ARM boards even with proprietary video drivers which don't exist for Xorg server (but only for Wayland for example). Look at the boards based on NXP SOC i.mx8 series.

May I hope new version for non accelerated X server will appear soon? -)

CendioOssman commented 4 years ago

Unfortunately that is not likely right now. It is not something we're prioritising right now. Anyone else is free to look at it though and suggest patches. It's also possible to add a bounty to this issue to attract developers.

MichaIng commented 4 years ago

Best would be probably to have any fancy GPU features optional via client option, so anyone can choose whether those are needed (at cost of CPU usage when GPU acceleration is missing) or not. However, I guess maintaining two ways of talking to X is too much coding effort now.

But just as a general idea for future changes on top of the basics, it's nice to keep things modular and optional, especially when it implies new dependencies and increased hardware demand.

CendioOssman commented 2 years ago

The "Present" extension might be something to look at as it might avoid one of the copies.