sometimes Xpra keeps on starting Xvfb-for-Xpra processes endlessly

Xpra-org / xpra

Persistent remote applications for X11; screen sharing for X11, MacOS and MSWindows.

https://xpra.org/

GNU General Public License v2.0

1.93k stars 165 forks source link

sometimes Xpra keeps on starting Xvfb-for-Xpra processes endlessly #4250

Closed jhgoebbert closed 2 months ago

jhgoebbert commented 3 months ago

We see that sometimes Xpra keeps starting Xorg-for-Xpra processes with constantly rising display numbers. Then after 6 hours 1800 processes are running (every ~10 seconds a new one pops up) and the system is blocked.

Xorg-for-Xpra-S24055 -noreset -novtswitch -nolisten tcp +extension GLX +extension RANDR +extension RENDER -auth /p/home/jusers/.Xauthority -logfile /tmp/xpra_sockets_19601_8651gz3u/xpra/S24055/Xorg.log -configdir /tmp/xpra_sockets_19601_8651gz3u/xpra/S24055/xorg.conf.d/24055 -config /p/software/juwels/stages/2024/software/xpra/5.0.8-GCCcore-12.3.0/etc/xpra/xorg.conf -depth 24 -displayfd 12

I wonder if the reason for this is the timeout from write_displayfd https://github.com/Xpra-org/xpra/blob/v5.0.8/xpra/platform/displayfd.py#L16 Instead of stopping Xpra when the timeout occurs, it continues to endlessly start new "Xorg-for-Xpra" processes.

It could also be from read_displayfd - but that would not match the 10-second-frequency of respawns. https://github.com/Xpra-org/xpra/blob/v5.0.8/xpra/platform/displayfd.py#L50

We are using Xpra 5.0.8 on Linux systems (Redhat 8) and xpra gets started by https://github.com/FZJ-JSC/jupyter-xprahtml5-proxy with the following command

xpra start --html=on --bind=/tmp/xpra_sockets_19601_8651gz3u/xpra-server,auth=none --socket-dir=/tmp/xpra_sockets_19601_8651gz3u --start=xterm -fa "DejaVu Sans Mono" -fs 14 --clipboard-direction=both --no-keyboard-sync --no-mdns --no-bell --no-speaker --no-printing --no-microphone --no-notifications --no-systemd-run --sharing --no-daemon

totaam commented 3 months ago

I'm confused, xpra only starts the vfb once on startup. Why would you get 1800 processes?

jhgoebbert commented 3 months ago

Thank you for your reply.

If xpra does not retry to start the vfb but would just exit with an error then it might be jupyter-server-proxy which restarts the process over and over again. 🤔

But could it be that xpra does not clean up the Xvfb-for-Xpra process if it fails to start? Perhaps here: https://github.com/Xpra-org/xpra/blob/v5.0.8/xpra/platform/displayfd.py#L37

I have to investigate more on this. Until now I just see this behavior and try to find out why it happens sometimes. Unfortunately, this is quite rare and difficult to reproduce - at least until I know how to force it.

totaam commented 3 months ago

@jhgoebbert the https://github.com/Xpra-org/xpra/blob/7c3ed93a11817f23af809c4ea82b680c33ef6160/xpra/platform/displayfd.py#L16 function is used by xpra if xpra's own --displayfd command line option is used, typically so a process wrapping xpra can be told what display was allocated.

The one used for parsing the displayfd sent by the vfb subprocess is: https://github.com/Xpra-org/xpra/blob/7c3ed93a11817f23af809c4ea82b680c33ef6160/xpra/platform/displayfd.py#L50

I assume that this is the one you refer to. It is called from here: https://github.com/Xpra-org/xpra/blob/7c3ed93a11817f23af809c4ea82b680c33ef6160/xpra/x11/vfb_util.py#L283-L313 I think that you're right and the exception handler should probably kill the vfb at that point.

This problem should be pretty rare, after all the default is 20 seconds! https://github.com/Xpra-org/xpra/blob/7c3ed93a11817f23af809c4ea82b680c33ef6160/xpra/platform/displayfd.py#L13

totaam commented 3 months ago

Please try 8a4a964268eef58ad9e3007831772a14cf261ae7 (untested). You should be able to reproduce it more readily using XPRA_DISPLAY_FD_TIMEOUT=1.

jhgoebbert commented 3 months ago

Hi @totaam, thank you very much for your patch! This looks good - it will ensure that no "zombie"-XServer are left behind. Perfect. But I have not yet tested it because our systems are down at the moment.

In the meantime I am still trying to understand why Xpra gave up and exited in some situations in the first place. At the moment we have the impression that the following is happening: 1) Xpra starts an XServer. This XServer searches for a free display by itself. Presumably via the entries in /tmp/.X11-unix. If it findes, for example, that the next free display is :4 it starts on :4. 2) Xpra then asks the XServer for the display number it has choosen and checks /tmp/\<HOSTNAME>-\<DISPLAYNUM> to see whether an Xpra socket file for display :4 already exists. 3) If this is the case - perhaps because another Xpra session of some other user did not clean up correctly and left the socket file behind (?) - Xpra concludes that the display number is already in use by some other Xpra server running on :4 and so it then exits. 4) In our case jupyter-server-proxy now detects that the Xpra process died exit and restarts it.

We cannot yet understand why this step 3 could lead to a loop in which 1800 XServers got started. The only idea I have is, that Xpra started a new XServer on a new display but always connected to the fist XServer it started in the loop to end up in the same exit-situation. So your patch might also solve this.

I will come back to you as soon as I have tested your patch.

jhgoebbert commented 3 months ago

btw: We could see this behavior always then, when xpra list shows an INACCESSIBLE session. I assume the reason for an INACCESSIBLE session is that some other user is running Xpra on that display or some socket-file has not been cleaned up.

[]$ xpra list
Found the following xpra sessions:
/tmp:
    INACCESSIBLE session at :1
    INACCESSIBLE session at :2
    LIVE session at :33
    INACCESSIBLE session at :4
    INACCESSIBLE session at :5

This then leads to the error when starting xpra:

xpra start --start=xterm --no-daemon
using systemd-run to wrap 'seamless' xpra server subcommand
Running scope as unit: run-r9384454326834c31a92b18ffc5554b26.scope
2024-06-06 15:23:48,043 Warning: cannot enable SSH socket upgrades
2024-06-06 15:23:48,043  No module named 'paramiko'
2024-06-06 15:23:48,046 no uinput module (not usually needed)
_XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
_XSERVTransMakeAllCOTSServerListeners: server already running
_XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
_XSERVTransMakeAllCOTSServerListeners: server already running
_XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
_XSERVTransMakeAllCOTSServerListeners: server already running
_XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
_XSERVTransMakeAllCOTSServerListeners: server already running
_XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
_XSERVTransMakeAllCOTSServerListeners: server already running
[..]
2024-06-06 15:23:50,367 xpra server initialization error:
2024-06-06 15:23:50,367  An xpra server is already running at '/tmp/jwvis02.juwels-5'

totaam commented 3 months ago

Xpra concludes that the display number is already in use by some other Xpra server running on :4 and so it then exits.

No, it will probe this socket to see if it belongs to a dead server or not. If the socket's modified timestamp is recent, it will wait longer - potentially waiting for a server to complete its startup sequence. (servers continuously update this timestamp to help other processes detect the socket as "alive")

Servers can take a long time to startup because of pyxdg and the loading of all the commands' icons.

I assume the reason for an INACCESSIBLE session is that some other user is running Xpra on that display or some socket-file has not been cleaned up.

https://github.com/Xpra-org/xpra/blob/e10a1e8b9a108da85a77dca333db756396062c70/xpra/common.py#L65 The only way for xpra to show this state is from here: https://github.com/Xpra-org/xpra/blob/e10a1e8b9a108da85a77dca333db756396062c70/xpra/platform/dotxpra.py#L118-L119 errno.EACCES typically means that the permissions on the socket do not allow the current user to connect to it. Unless someone manually does something very weird with chmod, this means that the socket belongs to a different user. I see that you're using /tmp for your sockets. That's not a good location for a multi-user system, and even for a single user system, /tmp is special (sticky bit). If you are unable to use XDG_RUNTIME_DIR, you may want to add users to the xpra group and use /run/xpra instead.

jhgoebbert commented 2 months ago

Thank you for all these great hints.