AllenInstitute / datacube

Other
0 stars 1 forks source link

configure websocket client autoping to serve as tcp keep alive #89

Closed chrisbarber closed 6 years ago

chrisbarber commented 6 years ago

LTMs kill idle TCP sessions at 5min, which is a sensible configuration. WAMP auto ping is an obvious way to keep live sessions from appearing idle. Also, timeouts should probably be in place so that the crossbar router is not relying on the LTM for preventing a buildup of stale connections.

https://github.com/AllenInstitute/datacube/blob/dev/.crossbar/config-prod.json#L80-L81

Note that conn_bridge uses the anonymous/public /ws transport, so it will be subject to this configuration as well. The timeout should probably just be set high enough that even if conn_bridge is under load it won't have a problem responding in time. If this configuration is not suitable for conn_bridge at any point, it could be moved to using the authenticated transport (/auth_ws).

This might be a good opportunity to review the auto ping configuration on the service (/auth_ws) side as well. This behavior was disabled when it was found during extreme load testing that it could be difficult for the services to respond in time, and we would prefer an eventual response during a high-load scenario than for the service to be disconnected. Stale connections (arising from services going down multiple times without being able to cleanly close their connection) might still be a concern, in which case a large timeout may be most appropriate, rather than leaving the feature totally disabled.

https://github.com/AllenInstitute/datacube/blob/dev/.crossbar/config-prod.json#L112-L113

Once this is reenabled on the client side, a test can be performed with an LTM in-the-loop to verify that the pings are enough to prevent the TCP sessions from appearing idle and being disconnected.

chrisbarber commented 6 years ago

Autobahn just keeps one pending ping at a time:

https://github.com/crossbario/autobahn-python/blob/7f0d1a5e678870f239e0364cd30e8e7afbabbd12/autobahn/websocket/protocol.py#L1713-L1732

https://github.com/crossbario/autobahn-python/blob/7f0d1a5e678870f239e0364cd30e8e7afbabbd12/autobahn/websocket/protocol.py#L1858-L1876

So for example, an interval of 1s and timeout of 5s could mean 6s of idle, until finally receiving a pong or dropping the connection.

Even with autoPingTimeout disabled, autobahn will not send the next ping until a pong is received, so a disconnected client will still cause the session to become idle.

chrisbarber commented 6 years ago

Regarding stale connections building up on the crossbar router, websocket-level timeouts are not necessary as we can just rely on TCP timeout (typically 2 hours on Linux). The only reason for configuring autoPingTimeout therefore is to get quicker notification of a lost connection, which currently we don't actively have any handling for nor are there expected use-cases to my knowledge. See here.

In general, we are not going to be able to preserve the connection for very long if the client becomes disconnected for a period of time. The correct thing to do is for the endpoints to reconnect gracefully.