Socket.io closes prematurely due to timeout before connection is established

anathemamask commented 4 years ago

Pursuing an issue that comes up in #Troubleshooting and #Installation-support on discord with some frequency but which has been hard to nail down a cause for, I went down the rabbithole of Websocket.io

Problem: Players attempting to connect experience connection timeouts and stalls when trying to join a game on slow connections. I suspect this may be one of the more common causes for "two of my players can't log in, but others are fine."

Players who experience this seem to catch one of two errors: Either the websocket seems to stall, doing nothing and sitting idle- or they receive a console logged error "Websocket is closed before the connection is established"

Suspected issue: I haven't been able to replicate the exact conditions under which this happens, but it does seem to have two particular common factors in situations I've encountered it:

The user or host has an internet connection somewhat less than recommended
The hosting server almost always has a large front-load of data that has to be passed to the user, usually a large volume of modules active that are involved in imports of localization files or other sizeable downloads that trigger on user connection.

A delve into that error message suggests it occurs as a result of socket.io firing a request to close prematurely, and the usual recommendation for resolving this exact error is to expand the duration of pingTimout and pingInterval vars to prevent this.

Some relevant details here: https://stackoverflow.com/questions/30991326/socket-io-server-side-timeout-for-inactive-client-connection https://stackoverflow.com/questions/56954901/transport-timeout-too-short-causes-angular-socket-io-web-socket-to-close-before ~Bug ~Client

anathemamask commented 2 years ago

Originally in GitLab by @aaclayton

I think it's unlikely that you are experiencing this exact issue, as we did resolve this problem quite some time ago (last year). Could you share details about what issue you are having that leads you to believe it is this problem?

anathemamask commented 2 years ago

Originally in GitLab by @atouzard

Hello everyone,

Is this issue fixed ? I'm having the same problem in 0.8.9.

@kakaroto Where do I need to change the pingTimeout ? I found one variable set to 6e4 in express.mjs. Is this the place ?

anathemamask commented 3 years ago

Originally in GitLab by @kakaroto

I also gave that advice once, but it didn't seem to have any effect for some reason (that happened in one of my own games). Also, it wouldn't work if the one that got locked out of their account is the GM themselves. I usually just tell the user to open the dev console and inspect the select combobox and remove the "disabled" attribute on the user option.

anathemamask commented 3 years ago

@kakaroto -- In situations like that, my usual instruction to a GM is to right click the ghosted user and kick them which seems to resolve it in most cases (and is fairly common behaviour in most video game lobbies).

Is this not the case?

anathemamask commented 3 years ago

Originally in GitLab by @kakaroto

Just a follow up to this by the way.

There does seem to be a consequence to this change in that if the user loses internet connection abruptly, then the server doesn't recognize the user has been disconnected for the whole 10 minutes. Meaning the user will not be able to log back into the game because they user will be greyed out (I'm still wondering why that forces it).

In light of the discussion in #3038 and the request in #3063, I think a simple fix here would just be to remove the disabled on logged in users and perhaps show them instead as logged in via a different mechanism (a :heavy_check_mark: next the name instead of disabling it).

Otherwise, the alternative might be to add a timeout on the socket itself. I think why this is failing is because of this : https://nodejs.org/api/net.html#net_socket_settimeout_timeout_callback

By default net.Socket do not have a timeout.

I don't know if that's just a node socket timeout, but I assume it affects the underlying kernel side socket timeout which is why it doesn't detect the disconnection automatically. I assume adding the setTimeout to 30 seconds (socket.io pingTimeout is still at 25 seconds, right?) will be enough to fix that behavior.
I think the only way to reproduce it would be to unplug the ethernet cable then close the foundry window.

anathemamask commented 3 years ago

Originally in GitLab by @aaclayton

Thanks for weighing in and explaining. I'll make that change and we can see how it goes in 0.7.1. I'm not inclined to push this change to the release branch without testing it a bit first in case there is any other socket weirdness that comes up due to having the long timeout.

anathemamask commented 3 years ago

Originally in GitLab by @kakaroto

The change from websocket to polling is only so you can reproduce it by using chrome dev tools' throttle function, that change should not get committed.

The TL;DR is to keep the pingInterval to 25 seconds but change the pingTimeout to 10 minutes. It won't have any adverse impact on detection of disconnections because if there is an actual disconnection the OS's kernel would detect it via the TCP/IP stack's own SYN/ACKs (keep alive pings in the TCP protocol), and you would know that there was a disconnection. The websocket pings are I believe only there to prevent reverse proxies from thinking the connection is stale (i.e: it's connected but no data flowing through it), I don't think they are there to actually detect disconnections because the OS can do a much better job at that.

With that change (increase pingTimeout to 10 minutes), it will keep sending pings every 25 seconds, but if it can't receive the ping (because the socket is busy receiving a single huge chunk of data), then it won't consider it 'closed' for 10 minutes. If the actual connection is closed as reported by the OS/chrome, then it will detect that disconnection normally.

anathemamask commented 3 years ago

Originally in GitLab by @aaclayton

@kakaroto can I get the TL;DR? The change of transport to "websocket" instead of "polling" was an intentional one designed to force use of websockets for users who were (incorrectly) getting stuck using XHR.

The other thing you mention is increasing the pingInterval, which would be easy to change - but I am concerned that a high value of this timeout will have an adverse impact on the server's ability to detect that someone has disconnected and make their login available once again.

Can I get a quick summary of your opinion?

anathemamask commented 3 years ago

Originally in GitLab by @kakaroto

I know this is slated for 0.7.1 (soon to be released, yeay!) but would appreciate having this also released on a 0.6.6 beta channel, as it's affecting a lot of users and it's a critical bug IMHO, so I'm hoping it lands in the stabler branches too.

anathemamask commented 3 years ago

Originally in GitLab by @kakaroto

it's not like its pings will detect a loss of connection that the OS itself isn't able to detect.

I think the pings may actually be so proxies and the actual browser detect activity on the connection and doesn't terminate it. So the pings would still make sense to have, because while a 10 minute ping interval would work fine for receiving the world, as there is continuous data received there.. if there is no socket activity for 10 minutes, it might get closed by chrome. Therefore, I'd suggest instead keeping the pingInterval to 25 seconds, but making the ping Timeout value 10 minutes, so it keeps sending pings every 25 seconds but it would consider it timed out only if it can't receive one for 10 minutes.

Note: I'm suggesting 10 minutes because the world I tested with took 8.9 minutes when throttled to "slow 3G". 10 minutes is still a very long time for someone to wait, but it was a 66MB world (25.9MB compressed) and slow 3G is.. well, slow... so kind of maybe expected anyway ? I don't know if a lower timeout is worth it, and don't know if a higher timeout would make sense either.

anathemamask commented 3 years ago

Originally in GitLab by @kakaroto

oh, euhh... TLDR: Server side, increase the pingInterval on the socket.io engine to 10 minutes please.

anathemamask commented 3 years ago

Originally in GitLab by @kakaroto

Note: Wrote this while debugging. Found the cause/solution at the end.

So, I found out that chrome dev tools has a "throttle" option in the network tab! I tried it, and... well, nope, it doesn't help because websockets are ignoring the throttle option.

That's a 5+ year old bug that still hasn't been fixed. See : https://bugs.chromium.org/p/chromium/issues/detail?id=423246

Also, it turns out firefox has the same problem...

So what I did was edit foundry.js to change the socket.io transport from 'websocket' to 'polling', and tried again. using the "slow 3G" preset throttling option, here's a little screenshot of the issue within the Electron app :

Note: I'm testing this on a world a user sent me. The world is 25.9MB gzipped that gets sent over the network, with it being actually 66MB raw. I noticed that in their world they had a 28MB chat.db where one of the chat messages had a single 28MB base64 embeded image in its html content. Seems to have been caused by chat-images button module : <img class="chat-images-image" src="data:image/webp;base64,UklGRkhl etc....

I took this screenshot after it finished progressing.. it took 8.9 minutes and 25.9MB downloaded before the bytes-downloaded and time portions of that websocket polling stopped changing. So there doesn't seem to be a timeout on the connection itself, and it doesn't seem related to the websocket either (as this was polling, so it's a socket.io thing maybe).

I checked the content-length in the response and it was 27159661 which matches the 25.9MB so it looks like the world data did get entirely downloaded, but foundry doesn't act on it.

I've then tried using the "Fast 3G" throttling option. This one is much faster, and finished loading in 2.5 minutes instead. But that still didn't make it work, so whatever timeout seems to happen, it's closer to 2 minutes than 10...

I had set breakpoints after the socket.emit but foundry doesn't get the response, so I decided to try something else. I went into node_modules and copied the socket.io-client/dist/socket.io.slim.dev.js into public/scripts and renamed it to socket.io.slim.js so it gets loaded instead of the one foundry serves.

Putting breakpoints in socket.io's Socket.onPacket, I could see that the socket's state had become "closed" by the time it received the data.. and when the socket gets closed, all of its callbacks are reset, so even though it internally calls this.emit('packet', data), it doesn't actually emit it anywhere since there are no more callbacks registered, so the packet is simply lost.

I've added breakpoints to places where the this.readyState = "closed" is called, and found that the socket is closed due to a ping timeout.

I'll spare you the internal socket.io architecture, but suffice to say that when a single Socket (whether a single HTTP polling connection or a single websocket connection) associated with the client side socket Manager times out, that socket gets closed and all callbacks discarded. The ping timeout is a server side setting and the values we receive are defaults which are : ping every 25 seconds, timeout after 5 seconds. In other words, if a socket doesn't receive a ping every 30 seconds, the socket is closed (see API).

In the polling case, that means that if it takes more than 30 seconds to get the world data, that specific polling connection becomes 'closed' and its data lost, even if we do successfully receive it after 31 seconds. In the websocket case, there's a single connection, so while we are receiving the world data, the socket is blocked and cannot receive pongs, since it cannot be mixed with the other data (note that if it does send the ping like I've shown in the screenshot of my previous comment, that must be because it created a separate Socket object, associated with the same client side socket Manager object, which is independent and has its own time to live, so we receive a ping, the manager stays alive, but the Polling object receiving the actual world response will itself timeout).

So, basically, the simple solution would be to increase the ping timeout server side so that it doesn't time out within 30 seconds, which is incredibly small amount of time, especially when you have some people with 50MB+ downloads on slow connections.

One side effect is that a loss of connectivity will be 'theoretically' harder to detect by foundry because if you increase the timeout to 10 minutes for example, then someone may have lost internet for 10 minutes without foundry noticing. I think though that the purpose of pings within the socket.io is useless because the operating system's TCP/IP stack would itself send SYN/ACK packets to keep the connection alive and would notify the socket.io of an actual closed socket (FIN) if the connection had been lost... socket.io trying to implement a ping/pong on top of the TCP stack seems redundant here and completely useless in my opinion... it's not like its pings will detect a loss of connection that the OS itself isn't able to detect.

Also: With the ease of reproducing this behavior now.. It would be really good to have the actual UI appear in advance rather than the grey outline, and have a progress bar, or spinning circle or something about the world being loaded. That would improve the UX immensely.

Long term, it might be good to implement a partial load of the data, i.e: send user list, current scene for immediate rendering, then slowly load the rest of the data (and since it wouldn't be a single shot download, it could have a progress bar now too). Chat could be loaded progressively instead of having it all in the client.

I hope the ping timeout increase can land in a 0.6.6 if there's one planned soon (I'd expect the UX improvement to be a 0.7.x thing).

Thanks!

anathemamask commented 3 years ago

Originally in GitLab by @eadorin

I encountered this issue today on localhost. A bug in a module I was developing ran setFlag() in updateToken() triggering an infinite loop. Upon refresh, I could no longer connect. None of the files in my data directory appeared large. (fog was the largest at 3.7 MB). Restarting Foundry resolved the issue. I believe the issue is with some other database caching on the part of electron/node, at least in my instance.

anathemamask commented 3 years ago

Originally in GitLab by @ryan1pepin

I believe constrained resources, especially cpu, may be part of the issue. I set up a foundry server on a raspberry pi 4 in the past 2 weeks. In most game systems it worked great but I noticed In pf1e .63 or so I was able to get it to reliably crash by opening the rather large bestiary compendium or requiring the preloading of compendiums (about 50-100 mbs). Using PM2 Monit the memory didn't seem to be an issue, but rather the CPU. cpu usage would spike to 150/200% when making a request for the bestiary and it was then that the connection would time out. I increased the cpu speed from 1.5 to 2.1 and this helped the issue. A later PF1E update broke the beastiary into two parts --- this helped reduce the issue further. hope this helps!

edit extra notes- pi 4 is on gigabit ethernet connected to gigabit fiber up/down.

based on the memory use of the process, for pf1e preloading of compendiums was useless with default node.js garbage collection. The server would preload the compendium in memory on connection... but garbage collection would clean it up within 30 seconds, and the clients still had to get the data from the server when opening the compendium.

anathemamask commented 3 years ago

Originally in GitLab by @kakaroto

That was also my original assumption, though I'm not sure about the timing out because ws shouldn't timeout, and because I had a similar behavior (though temporarily and which now disappeared) on localhost with a small world, so I don't know.

anathemamask commented 3 years ago

Reporting back on this issue.

Following up with the user whose players were affected by this from last night; they were able to confirm that shifting all actors into compendia resolved the connection issues for their players.

The actor database at time of world export contained nearly 2000 actors and was 45mb.

My present theory on what this would cause disconnection is: (pure speculation)

-Foundry initiates a websocket connection and starts the data transfer process for the database files that need to be loaded, transferring them out to players

-In the time it takes to transfer a sizeable database of information to the player, the websocket connection times out

-When the transfer finishes, Foundry attempts to initiate the connection but the websocket has already closed due to timeout so it simply sits there, doing nothing.

It would be very very easy to miss this in general use for most users, especially ones hosting on faster connections.

anathemamask commented 3 years ago

Originally in GitLab by @kakaroto

Yep, usually those with the issue have had thousands of actors or items and clearing those out have fixed the problem for most of them, though it doesn't always fix it (and they usually can't enter their world to fix it).

anathemamask commented 3 years ago

Just conferred with another user affected by this and was able to confirm that though this affected all incoming connections in a world with a lot of imports from D&D beyond, a fresh world does not experience this issue. Present operating theory is that a large number of entities not stored in compendia can result in this. Have the user attempting to migrate their world's data into compendia to confirm this is a viable work around.

anathemamask commented 3 years ago

Originally in GitLab by @kakaroto

I've had a few reports of this issue as well and most of the time, it turns out that the person had a large number of actors. I believe that's what "sizeable downloads that trigger on user connection", it's not about scene background image, but rather about the world database itself.

In the user experiences what I've found is that most of the time, the GM has used modules to import/generate a high number of actors (2000+) which all get downloaded at the time the user connects. Some of the players get the problem, some don't, and the slow internet speed does seem to affect it.

On Firefox, the error can be shown as : "The connection to wss://domain/socket.io/etc... was interrupted while the page was loading" and the only thing I found about this is a 5 year old bug in Firefox (which was fixed 5 years ago).

I've been thinking that it must be because the data to be downloaded (usually 50MB+) is slow and by the time it would be received, the websocket connection times out, but that doesn't seem to be the case, because I realized that :

The chrome/firefox timeout of 2 minutes does not apply to an established websocket
socket.io will continue sending the probe/pongs while it's waiting to receive the world data

I have also been able to reproduce the bug locally earlier today on my local install (localhost, so best connection speed you could hope for) on a small world (3MB of data). It was happening completely randomly, persisted over multiple restarts of the server (Electron app), and then it would suddenly work, then break again, then it stopped happening.

It is also unrelated to number of installed modules I believe because this is affecting the websocket connection itself and the world data not being sent by the server.

You can see above a screenshot of the websocket messages, and you see that the 3.8MB world data took over 39 seconds to be returned by the server (it's currently taking only 500ms now that the bug doesn't happen anymore). You can also see the automatic ping/pong socket.io probing that happened 15 seconds in that shows that the connection itself wasn't stalled on either client or server.

If it can stall for 39 seconds on my local install, then it can probably stall forever at some other point for someone else (it did refuse to load a bunch of times, I only got the screenshot for this one where it ended up loading after 39 seconds). So I think this infamous issue is unrelated to database size or number of actors, or even slow connection.. I think there's something else in the 'get world' handler on the server that could make it lock up.

Note: I had also gone to setup and checked the 'getSetupData' websocket request, which took 25 seconds to return instead of the usual 4 to 5 seconds. I don't know if it's related in any way.

I've also received the world export from someone who had this problem a couple of days ago, I plan to import it on my Asian servers and test to see if I can reproduce the problem with that world through a slow connection. Will let you know if I find anything new.

Hopefully this helps!

anathemamask commented 4 years ago

Originally in GitLab by @ariakas81

I am particularly interested in this one:

"The hosting server almost always has a large front-load of data that has to be passed to the user, usually a large volume of modules active that are involved in imports of localization files or other sizeable downloads that trigger on user connection"

Is there anyone that can detail better that "Sizeable downloads that trigger on user connection"? I thought they are just the Active scene background image and its tokens...

foundryvtt / foundryvtt

Socket.io closes prematurely due to timeout before connection is established #3047