cylc / cylc-ui

Web app for monitoring and controlling Cylc workflows
https://cylc.github.io
GNU General Public License v3.0
37 stars 27 forks source link

slow initial connection #1200

Open oliver-sanders opened 1 year ago

oliver-sanders commented 1 year ago

Updates

OP

Sometimes it takes a long time for the UI to connect with the backend.

It seems to get there in the end but it often takes in excess of 45 seconds to establish the subscription connection. The issue occurs randomly when the page is loaded and causes the no-connection error box hangs around.

Whilst we are waiting for the connection this message appears multiple times (Firefox):

 Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.

Which suggests multiple failed attempts at connection, however, after jamming some logging into the UIS (backend server) I can see no evidence of connection attempts (at least none which actually get into the websocket code). The UIS responds quickly once it receives the request.

It's not clear what's causing this, however, it is worth noting that we are using a deprecated framework for subscriptions, we should try moving to something more modern and see if the problem goes away by itself - #1028.

MetRonnie commented 1 year ago

I don't think I have ever encountered this on Chrome

oliver-sanders commented 1 year ago

Try FF/VDI, the lag can be up to one minute.

oliver-sanders commented 1 year ago

I've replicated this with offline mode on the Vue 3 branch :(

MetRonnie commented 1 year ago

Assuming this is just an issue on VDI, what could be the cause?

oliver-sanders commented 1 year ago

I'm assuming this is a general issue exacerbated by a laggy platform, you can see multiple connection attempts in the console.

hjoliver commented 1 year ago

(What's VDI - virtual desktop?)

oliver-sanders commented 1 year ago

I've done some investigation into this but have come out empty handed.

I've also tried migrating from graphql-transport-ws to graphql-ws, however, this did not fix the issue.

So far the error has only been reported in RedHat builds of Firefox where it presents like so:

 Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.

This error will persist across multiple retries, it may last for over one minute before clearing.

This happens randomly on page load, sometimes it connects straight away.

kinow commented 11 months ago

P.s. I used to see this when developing the UI. Never investigated the cause though. Sometimes ffox would keep showing the red icon at the top, showing that I was offline and there would be those js messages.

Other times it was after I opened the browser the first time. It happened with just that five workflow, or with multiple workflows (i.e probably not the amount of data).

However, when I used that fake uiserver (offline mode or random tests in python + pre made request/response) I would not get that.

So my guess would be something between browser, uis, and first xhr and/or wss request. Maybe setting a debugger in some parts of uis graphene or web request handling code that could be busy (handling an old request, waiting for some background task?) would replicate the issue...

oliver-sanders commented 10 months ago

This report on Bugzilla looks very similar:

Websocket ssl connection can be very slow to connect in Firefox

https://bugzilla.mozilla.org/show_bug.cgi?id=1478314

As does this issue:

Firefox takes multiple attempts to connect to websocket resulting in slow UI loads

https://github.com/fluidd-core/fluidd/issues/931

There a bunch of other similar sounding reports out there with varying levels of details. Here are some summarised statements pulled out of these reports:

I've replicated the issue with:

oliver-sanders commented 10 months ago

Issue now replicated on:

Chromium error:

WebSocket connection to '<URL>' failed: WebSocket is closed before the connection is established.

Gecko error:

 Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.
oliver-sanders commented 10 months ago

Connection Timeout

After some experimenting, I've seen a couple cases where connections failed on connection timeout. The timeout starts at 1s and increases with each failure. This diff bumps the default 1s connection timeout up to 10s:

diff --git a/src/graphql/index.js b/src/graphql/index.js
index cc252c96..141ee0ef 100644
--- a/src/graphql/index.js
+++ b/src/graphql/index.js
@@ -74,7 +74,9 @@ export function getCylcHeaders () {
 export function createSubscriptionClient (wsUrl, options = {}, wsImpl = null) {
   const opts = Object.assign({
     reconnect: true,
-    lazy: false
+    lazy: false,
+    minTimeout: 10000, // default 1000
+    // timeout: 300000 // default 30000
   }, options)
   const subscriptionClient = new SubscriptionClient(wsUrl, opts, wsImpl)
   // these are the available hooks in the subscription client lifecycle

Which appears to clear connection timeout issues, however, this would appear to be a distraction which does not solve the main issue.

Connection closed before it's established

Call stack (note line numbers shifted slightly by diff):

Screenshot from 2023-11-20 11-04-27

onclose first argument ``` close { target: WebSocket, isTrusted: true, wasClean: false, code: 1006, reason: "", srcElement: WebSocket, currentTarget: WebSocket, eventPhase: 2, bubbles: false, cancelable: false, … } ​ bubbles: false ​ cancelBubble: false ​ cancelable: false ​ code: 1006 ​ composed: false ​ currentTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … } ​​ binaryType: "blob" ​​ bufferedAmount: 0 ​​ extensions: "" ​​ onclose: function onclose(r, i, o)​​ onerror: function onerror(r)​​ onmessage: function onmessage(r)​​ onopen: function onopen() ​​ protocol: "" ​​ readyState: 3 ​​ url: "wss://localhost:8888/cylc/subscriptions" ​​ : WebSocketPrototype { close: close(), send: send(), url: Getter, … } ​ defaultPrevented: false ​ eventPhase: 2 ​ explicitOriginalTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … } ​ isTrusted: true ​ originalTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … } ​ reason: "" ​ returnValue: true ​ srcElement: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … } ​ target: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … } ​ timeStamp: 6989 ​ type: "close" ​ wasClean: false ​ : function isTrusted() ​ : CloseEventPrototype { wasClean: Getter, code: Getter, reason: Getter, … } ```
MetRonnie commented 10 months ago

Looks like I can reproduce the problem by using a custom network profile in Chrome with latency of ≥ 3 secs! This might explain why you in particular keep seeing this, as you are plagued by bad ping times?

oliver-sanders commented 10 months ago

I think that would cause the connection timeout issue I mentioned above which I think presents in the same way, though should go away after enough repeats (timeouts willing). (note ping time issues shouldn't affect connections to local servers).

To tell the difference, I've been jamming debugging into subscriptions-transport-ws/dist/client.js.

This bit handles connection timeout issues:

438     SubscriptionClient.prototype.checkMaxConnectTimeout = function () {
439         var _this = this;
440         this.clearMaxConnectTimeout();
441         var duration = this.maxConnectTimeGenerator.duration()
442         this.maxConnectTimeoutId = setTimeout(function () {
443             if (_this.status !== _this.wsImpl.OPEN) {
444                 _this.reconnecting = true;
445                 console.warn(`Closed by max connect timeout: ${duration}`)                                                     
446                 _this.close(false, true);
447             }    
448         }, duration);
449     };         

But this one is the real issue:

483         this.client.onclose = function (code, reason, wasClean) {
484             console.log('# onclose', code, reason, wasClean)
485             if (!_this.closedByUser) {
486                 // TODO: this would appear to be the source of the problem!
487                 console.warn('Closed by onclose')
488                 debugger
489                 _this.close(false, false);
490             }    
491         };    
oliver-sanders commented 10 months ago

From testing with Dave on Friday (who seldom notices the issue), we found evidence of this issue in the console more times than not, i.e. the issue hist most of the time, but isn't noticed unless it persists for long enough. Worth checking the console for these messages.

MetRonnie commented 10 months ago

I pretty much always have the devtools open when I access the UI. If the "you are offline" snackbar goes away within ~2 secs I never see the

WebSocket connection to '' failed: WebSocket is closed before the connection is established.

warning

oliver-sanders commented 10 months ago

And when it doesn't you do?

MetRonnie commented 10 months ago

With my 3s latency profile enabled, yes I get those warnings. I actually get a few before the main content is displayed on the page

oliver-sanders commented 10 months ago

That approach is forcing a timeout error which is a different problem (the socket gets closed in a different part of the code, see above).

Note, this issue is extremely intermittent, I've just had 30 reloads in a row with no failed connections whatsoever, it's not necessarily easy for me to reproduce, I just have to brute force it.

oliver-sanders commented 10 months ago

From manual inspection, the websocket error code is 1006 which is:

Reserved for situations where the connection was closed abnormally without sending a close frame when a status code was expected.

This appears to mean that the connection was closed unexpectedly (no close message), but I'm not sure how it's determining that, is ping-pong established this early on.

Note, the reason field is blank. This exactly matches the response you get when you stop a running server.

kinow commented 5 months ago

@oliver-sanders ' in this issue show what I can now easily reproduce locally :tada:

I was reading an article on JS and performance for Autosubmit UI, when I noticed the author commented something interesting about browser settings that control the maximum number of connections to a server. The author also mentioned about this being a problem to WebSockets, which is why some people used a service workers to have a single client connecting to the server.

Anyhoo, I opened Firefox, then about:config, and searched for websocket. It returned only 12, with a couple of entries for devtools (these can be discarded).

Everything was working fine in the UI, and network.websocket.max-connections was set to 200 by default. Then I opened a few tabs, reduced it to 30, opened a few more tabs with the Cylc GUI, and voilà

image

I started closing my tabs and... eventually it connected.

image

I can't spend much more time on this issue, but I think

Open questions (that I wish I had time to keep digging :nerd_face:)

So if anyone is experiencing this a lot, and if you have a lot of tabs, go to about:config and increase that number (assuming you consider it safe as the server & client memory would be affected).

Possible solution:

I don't have this issue as we are not using WebSockets in the Autosubmit UI, so this is as far as my curiosity took me :slightly_smiling_face: , hope that helps.

Bruno

ColemanTom commented 1 month ago

Just commenting (and following) that I see this a lot too. But, I haven't seen this error message mentioned, so I'll add it (it always happens after the "can't establish a connection to the server" line.

The connection to wss://..../cylc/subscriptions was interrupted while the page was loading.

ColemanTom commented 1 month ago

So if anyone is experiencing this a lot, and if you have a lot of tabs, go to about:config and increase that number (assuming you consider it safe as the server & client memory would be affected).

It isn't just if you have many tabs. I have a single tab open and this happens. The page eventually loads, but it is slow.