ping requests are sent even when traffic is ongoing

vasild commented 5 years ago

socketcluster would send ping requests (and expect responses in time) even when traffic is ongoing on a connection. In general, ping keep-alive packets have two purposes:

to check whether the peer is still there, e.g. to check that he has not dropped / crashed / disconnected from the network
to keep the state in stateful firewalls

so, it does not make sense to send keep-alive packets when other traffic is being transmitted over the connection. It only makes sense when the connection is otherwise idle. From https://en.wikipedia.org/wiki/Keepalive:

Typically TCP Keepalives are sent every 45 or 60 seconds on an idle TCP connection

On top of being useless, the behavior can cause disconnects due to bogus ping timeouts when heavy traffic is being transmitted - the ping packets could get delayed beyond pingTimeout if lots of data is being sent/received.

jondubois commented 5 years ago

The reason why it always sends ping/pong (instead of waiting for the connection to go idle first) is because it makes the code simpler and more consistent and sending pings/pongs a few times a minute is cheap.

Heavy traffic should not cause ping/pongs to be missed if you have a safe margin between pingTimeout and pingInterval. It's possible that a very large message from a client (e.g. several tens or hundreds of megabytes in a single message) could cause a pingTimeout but in that case, it indicates that the connection was saturated so a timeout is the appropriate response to protect the server from having to spend more time servicing that (potentially malicious) socket.

Also, as mentioned in the other issue, the client relies on the absence of a ping to detect a lost connection; otherwise to detect disconnection on both sides, we'd have to have pings and pongs going in both directions which is more expensive and creates more edge cases (server ping <-> client pong, client ping <-> server pong).

vasild commented 5 years ago

I agree that sending ping regularly is easier to implement and the code is simpler, but I think that in this particular case the effects of it go beyond "design decision" and into the "bug" category. Just consider this - the connection would be dropped due to a ping timeout (aka "the other side has gone away") while traffic is ongoing.

It's possible that a very large message from a client (e.g. several tens or hundreds of megabytes in a single message) could cause a pingTimeout but in that case, it indicates that the connection was saturated so a timeout is the appropriate response to protect the server from having to spend more time servicing that (potentially malicious) socket.

A 20 MiB message is sufficient to trigger the timeout with the default pingTimeout of 20 seconds if the transfer speed is 1 MiB/sec. In other words, it is not possible to transfer messages larger than 20 MiB. Just consider any other library, protocol or application that transfers data over a network - ftp, http, apache, openssl, filezilla, curl, wget, younameit - none of those exhibit such a behavior.

The correct behavior is to just keep going if data is being sent or received through the connection.

vasild commented 5 years ago

In the above case the connection is dropped with an obscure message: Error [BadConnectionError]: Event '...' was aborted due to a bad connection

iMoses commented 5 years ago

I agree with @vasild that it seems more like a bug than a design decision.

We could keep sending pings all the time, because it's easy and cheap, and simply add more validations before declaring a "bad connection", verifying that we are not in a middle of a transfer, and if we are simply wait for the next ping (or something like that)..

jondubois commented 5 years ago

@iMoses You can allow multiple failed attempts without disconnecting. For example, if you set pingTimeout to 30000 (30 seconds) and pingInterval to 9000 then you can miss 2 pings without disconnection. It will only disconnect if you miss 3 full consecutive pings.

jondubois commented 5 years ago

it is not possible to transfer messages larger than 20 MiB

It is possible but not recommended due to reasons beyond the pingTimeout. SocketCluster is optimized for handling JSON; objects larger than 20MB can take a while to parse and because JSON parsing is typically a synchronous operation in Node.js, it's often not a good idea to send such large objects in a single message.

That said, if necessary, it's possible to increase the pingTimeout to allow messages bigger than 20MB so it's not a hard limit.

Synchronous operations which freeze the event loop for too long should also cause pingTimeout. When it comes to healthchecks; if you need to check that a process is healthy, it's not enough to detect that it's running, you also need to check that it's responsive (e.g. not stuck in an infinite loop). The ping/pong should be thought of as a healthcheck for the connection; it's not only about connectivity.

vasild commented 5 years ago

@jondubois,

Network throughput can be highly volatile. The above example of 20MiB message is if the transfer speed is 1 MiB/sec - I am sure you know that this can vary a lot, either it could be a slower connection (e.g. on mobile or whatever, just slower), or it could in general be faster but occasionally have reduced throughput, maybe because being overloaded or for other reasons. For example one 20MiB message would transfer for the same time as one 2MiB, if 10 such smaller messages are transferred at the same time. If the app transfers 100 messages "concurrently", then we go to 200KiB limit per message on a stable 1MiB/sec connection.

If we have to set hard limits with pingTimeout and pingInterval, then we practically declare "it will not be possible to transfer a message if it takes longer than X seconds" whereas X will not adapt to network conditions. Safely choosing such a X to account for all throughput fluctuations means to set it so high, that the ping timeout is practically disabled.

jondubois commented 5 years ago

For front end use cases, it's not unreasonable to timeout clients if a single overly large packet is taking too long (e.g. > 20 seconds) to transfer across the network because that would result in a UX issue anyway.

For backend usage of SC, the case is stronger. The problem there in terms of implementation is that most popular client WebSocket APIs (which SC depends on) do not have an event for when individual TCP packets are received; you either get the whole message (based on the WebSocket frame) or you get nothing. This makes implementation difficult. Alternative implementations are more error prone.

It may be possible to find a middle ground and allow the user of SC to set an option to extend ping whenever a whole message is received (that way either pings or messages will prevent timeout) - This is what socket.io does. There is a small performance penalty for doing this but it's not too bad.

vermagaurav8 commented 1 month ago

I'm encountering persistent ping-pong timeouts in SocketCluster. My setup involves a basic server that sends 10,000 requests every 30 seconds to 5 connected clients (each consuming the published mutation). Each invokePublish request includes a 1024-character string.

I've set pingInterval to 20,000 ms and pingTimeout to 120,000 ms, which should be large enough, but I still experience intermittent ping-pong timeouts. This causes disruptions in the otherwise smooth connection and ultimately breaks the application's execution.

Any insights on how to resolve this issue would be greatly appreciated!

vasild commented 1 month ago

Disable the ping functionality by setting pingInterval and pingTimeout to a large value, e.g. 1 month?

jondubois commented 1 month ago

@vermagaurav8 the fact that connections otherwise operate smoothly could suggest that the ping timeouts aren't caused by the server struggling to keep up, but could be because of a different reason.

Is it possible that something related to your infrastructure (or network) is causing your sockets to become disconnected from time to time? (not related to server stress). Unexpected dropped connections can lead to ping timeouts.

In the past, I've encountered situations where proxies (e.g. nginx, HaProxy, apache) would quietly disconnect long-lived connections which lasted more than a few seconds. Last time I encountered this was when using Google's GKE; the Kubernetes Ingress load balancer would drop the WebSocket connections automatically after 30 seconds by default. I had to adjust the setting on the GKE control panel to allow long-lived connections. Before that adjustment, each socket would lose connection and auto-reconnect every 30 seconds and I'd get ping timeouts.

SC automatically recovers from lost connections by default so such 'quiet disconnections' might not be obvious at first.

Also overzealous firewalls (on the client or server) could potentially cause such issues.

If this is the cause of your issue and you cannot change the proxy or firewall settings, you will need to account for this disconnection scenario in your application logic. E.g. retry/resend missed messages (if relevant). You probably don't want to increase the pingTimeout too much as that would delay sockets which were quietly disconnected from being cleaned up. If that spams your logs too much, you could handle the error and calculate the rate of timeouts and log it less frequently.

SocketCluster / socketcluster

ping requests are sent even when traffic is ongoing #506