Xpra-org / xpra-html5

HTML5 client for Xpra
Mozilla Public License 2.0
203 stars 55 forks source link

Websocket connection gets lost due to unknown reasons #317

Open trawn-88 opened 1 month ago

trawn-88 commented 1 month ago

Our customers are encountering problems with a connection loss. This can happen at the start of Xpra. The Server started, the web-connection was established, but then the connection get lost and you receive a 502 response. I sent you the client and server logs via mail.

On Prod, we’re using:

ubuntu:20.04 XPra:4.4.6 XPra HTML5 Client:8.1 Wine: 8.0.2

On Dev, we’re using:

ubuntu:22.04 XPra:6.1 XPra HTML5 Client:14 Wine: 8.0.2

totaam commented 1 month ago

@trawn-88 I don't see any emails. Are you using a proxy?

trawn-88 commented 1 month ago

I have send it on Monday to totaam@xpra.org Do you want me to send it again? We are using Ingress on an nginx http-proxy to route into our kubernetes cluster

totaam commented 1 month ago

@trawn-88 no emails, please send it to my gmail. (same username at gmail)

totaam commented 1 month ago

From the javascript client log:

07:05:04.175 client closed Utilities.js:832:28
07:05:04.176 closing:  server shutdown Utilities.js:832:28

The server log contains nothing from xpra between these key events:

created tcp socket '0.0.0.0:8085'
no uinput module (not usually needed)
killing xvfb with pid 86

But, I find it quite odd how some of these server log messages are being duplicated (ie: 31.3GB of system memory), with slightly different timestamps too. Is it possible that you have two servers starting concurrently? This could easily explain all sorts of problems.


The second log shows:

websocket closed:  'Abnormal Closure' (1006) reason:  null , reconnect:  true , reconnect attempt:  0
connection-lost

This sounds a lot like unexpected EOF - also with a proxy.. So another possibility is that the proxy is deciding to drop the connection.


To get more information out of the server, you could run it with -d websocket or even -d network. But this is going to be very verbose, and I'm not convinced that this will show anything relevant. My guess is that the connection is being dropped by the proxy.

trawn-88 commented 1 month ago

Just in case, i send you more logs. Can you check them as well? Just to be sure

totaam commented 1 month ago

I am seeing some duplicated messages in the log samples you have sent:

$ egrep -h "killing|X11 display"  ExpertMode\ 502\ Server\ Log\ 2.txt 
2024-07-30 08:20:06.557 2024-07-30 08:20:06,557  connected to X11 display :0 with 24 bit colors
2024-07-30 08:19:51.862 2024-07-30 08:19:51,862 killing xvfb with pid 151
2024-07-30 08:16:21.777 2024-07-30 08:16:21,777  connected to X11 display :0 with 24 bit colors
2024-07-30 08:15:03.386 2024-07-30 08:15:03,385 killing xvfb with pid 86
2024-07-30 08:14:48.671 2024-07-30 08:14:48,671 killing xvfb with pid 88

Is this a single session in a container? Does the xpra server get re-spawned somehow? (could that go wrong?)

Do you allow concurrent connections? (aka sharing) Or did you connect twice to the same session (client 2):

$ grep -h -A 1 "disconnected" *Server\ Log*
2024-07-30 08:19:51.841 2024-07-30 08:19:51,841 HTML5 client 1 disconnected.
2024-07-30 08:19:51.828 2024-07-30 08:19:51,828  server shutdown
--
2024-07-30 07:54:18.664 2024-07-30 07:54:18,664 HTML5 client 2 disconnected.
2024-07-30 07:54:18.662 2024-07-30 07:54:18,662 1K packets sent (7.3M bytes)
--
2024-07-29 07:05:04.233 2024-07-29 07:05:04,232 HTML5 client 1 disconnected.
2024-07-29 07:05:04.190 2024-07-29 07:05:04,186  server shutdown

In 2 of the events above, the html5 client is booted out because the server is shutting down. This shutdown is initiated by a SIGTERM signal:

$ grep -h -A 2 "Disconnect" *Server\ Log*
2024-07-30 08:19:51.828 2024-07-30 08:19:51,828 Disconnecting client 127.0.0.1:48084:
2024-07-30 08:19:51.826 2024-07-30 08:19:51,825 X11 seamless server is terminating
2024-07-30 08:19:51.826 2024-07-30 08:19:51,825 X11 seamless server got signal SIGTERM
--
2024-07-29 07:05:04.190 2024-07-29 07:05:04,186 Disconnecting client 127.0.0.1:39118:
2024-07-29 07:05:04.152 2024-07-29 07:05:04,142 X11 seamless server is terminating
2024-07-29 07:05:04.152 2024-07-29 07:05:04,142 X11 seamless server got signal SIGTERM

Do you send the SIGTERM? How?

BTW, I see a lot of:

_XSERVTransmkdir: ERROR: euid != 0,directory /tmp/.X11-unix will not be created.

You can get rid of these warnings by creating the /tmp/.X11-unix directory before launching xpra.

The client logs don't have any useful messages in them:

$ grep -hr "closing:" ./Expert*Client*
08:19:51.897 closing:  server shutdown Utilities.js:832:28
08:19:51.900 closing:  server shutdown Utilities.js:832:28
08:19:51.938 closing:  server shutdown Utilities.js:832:28
08:20:59.027 closing:  server shutdown Utilities.js:832:28
08:20:59.030 closing:  server shutdown Utilities.js:832:28
08:20:59.069 closing:  server shutdown Utilities.js:832:28
07:05:04.176 closing:  server shutdown Utilities.js:832:28
07:05:04.180 closing:  server shutdown Utilities.js:832:28

But this is different from the previous log sample, which was showing websocket closed: 'Abnormal Closure' (1006). Here, the client is being told to disconnect when the server shuts down.

totaam commented 3 weeks ago

Using tcpkill to try to reproduce problems with a local server running on port 10000:

sudo tcpkill -i lo  port 10000

The broken connection can show up in random places, depending on when it triggers. Here's one for http asset download:

Traceback (most recent call last):
  File "/usr/lib64/python3.10/site-packages/xpra/net/http/handler.py", line 406, in send_head
    self.end_headers()
  File "/usr/lib64/python3.10/site-packages/xpra/net/http/handler.py", line 261, in end_headers
    super().end_headers()
  File "/usr/lib64/python3.10/http/server.py", line 535, in end_headers
    self.flush_headers()
  File "/usr/lib64/python3.10/http/server.py", line 539, in flush_headers
    self.wfile.write(b"".join(self._headers_buffer))
  File "/usr/lib64/python3.10/socketserver.py", line 826, in write
    self._sock.sendall(b)
ConnectionResetError: [Errno 104] Connection reset by peer

The server will no longer report those as scary errors: https://github.com/Xpra-org/xpra/commit/bad691a5554c989402ef4e799b11b210a18d4dc3

As for the client, this shows up as:

websocket closed:  'Abnormal Closure' (1006) reason:  null , reconnect:  true , reconnect attempt:  0
Utilities.js:832 connection-lost

Exactly like the one in this bug report.

The reconnect should work, unfortunately tcpkill also stops the client from loading the Javascript. I need to find a better tool.

totaam commented 3 weeks ago

Similar behaviour using an ssh tunnel that can be killed more easily with an exit (or control-d) + control-c:

xpra start --bind-tcp=0.0.0.0:10000 --start=xterm
for i in `seq 1 5`; do
    ssh -v localhost -L 10001:127.0.0.1:10000
done

Open a browser pointing to 127.0.0.1:10001. When killing the tunnel, the Javascript console shows:

websocket closed:  'Abnormal Closure' (1006) reason:  null , reconnect:  true , reconnect attempt:  0
Utilities.js:832 connection-lost
Utilities.js:832 audio: stopping stream
Utilities.js:832 close_audio_mediasource: audio_source_buffer=null, media_source=null, audio=null
Protocol.js:479 WebSocket is already in CLOSING or CLOSED state.
process_send_queue @ Protocol.js:479
(anonymous) @ Protocol.js:504
setTimeout
send @ Protocol.js:504
(anonymous) @ Protocol.js:630Understand this error
Utilities.js:832 cancel_all_files( closing ) will cancel: []
Client.js:337 [Violation] Added non-passive event listener to a scroll-blocking 'wheel' event. Consider marking event handler as 'passive' to make the page more responsive. See https://www.chromestatus.com/feature/5745543795965952
init_state @ Client.js:337
(anonymous) @ Client.js:2358
Utilities.js:832 connection-lost
Utilities.js:832 connection_progress( Connecting to server ,  localhost:10001/ ,  40 )
OffscreenDecodeWorkerHelper.js:53

The script has to re-start the tunnel quickly otherwise the re-connection hangs. Maybe we should handle this better - as the client doesn't seem to timeout in this case. Perhaps wait a little bit first too.

That said, the original query in this ticket is about identifying the cause of the browser's websocket connection Abnormal Closure (1006) and it seems that the only way that I can reproduce this is by forcibly closing the connection in the middle - be it a proxy, netcat, ssh tunnel, etc

Information not included in this ticket that is relevant:

totaam commented 3 weeks ago

Worth adding a link to the error handler for the 1006 websocket error code: https://github.com/Xpra-org/xpra-html5/blob/4fc0feadd95cfc01b2eafd7ecd751ebeda9dc17c/html5/js/Client.js#L2313-L2320

totaam commented 3 weeks ago

With the commit above, the html5 client handles unavailable server endpoints much more gracefully: it will retry to connect more quickly and does timeout if the server really has gone away.

trawn-88 commented 3 weeks ago

Thanks very much! We'll try it. But, as it seems, there are three Proxy-Steps between the Client and the Server. Following your original statement, that a Proxy should be responsible for the connection closure, we are also investigating this possibility

totaam commented 3 weeks ago

This looks relevant: https://stackoverflow.com/questions/49408031/websockets-in-chrome-and-firefox-disconnecting-after-one-minute-of-inactivity We have an application level ping in xpra, but perhaps chrome doesn't implement the protocol level ping - or, it may stop sending it when tabs are inactive?

totaam commented 3 weeks ago

I have also found a bug in the html5 client which prevented application layer ping packets from being sent to the server - all except for the very first one. (fix in 4f148c36a363b4cb2b0fe3fb2daa59ebe8568b7d) Although this sounds like something that could cause some middleware to drop the connection due to inactivity - in practice this should make no difference as the server was still sending its own pings and the client was still responding to them. And as far as middleware is concerned, ping-echo responses are indistinguishable from the pings themselves. (only the application layer knows the difference) The only effect of this html5 client bug would be on the server, which would not have access to the "latency measured from the client" values - this is only used by the server, it is only one of many data samples used to drive the heuristics - so not a major problem if it goes missing. The fact that we didn't notice for so long goes to show that perhaps it isn't very useful - beyond reporting that latency data via xpra info. But, still worth a try? Maybe it will help somehow?

totaam commented 3 weeks ago

To make it easier for you to test new versions, I have uploaded xpra-html5-v16 pre-release RPMs in the beta area: https://xpra.org/beta/

totaam commented 2 weeks ago

@trawn-88 if you are able to run xpra 6.2 beta, you could also enable -d ping to verify that the connection is being kept alive. (debug added in https://github.com/Xpra-org/xpra/commit/ab0a20622a67d36583a15d50ff2ed45229a670e8)

trawn-88 commented 2 weeks ago

I'll give that a try and send you the logs, ok? But it might take a while

totaam commented 2 weeks ago

OK, I'm still leaning towards an nginx proxy issue rather than a protocol layer issue, but this may help narrow it down.