It probably doesn't take 60s to send/recieve a NEXT or PROCEED control message on a working connection. Heart beat messages are used to quickly detect if the connection is valid, and if it is not a new TCP connection is created. Really high timeouts like 60s will increase the time it takes to determine if a connection is bad, reducing this to half should be both safe and result in detecting dropped TCP connections faster.
Why do this
This will benefit listening tentacles (mostly), since they have "pooled connections" which when retrieved from the pool are tested with heartbeat messages (NEXT/PROCEED). Reducing the timeouts means instead of taking 60s to detect a dead connection it takes 30s.
The polling service side also does these control messages, although not on connections that have been sitting idle. Lower timeouts on these messages mean that if the connection fails during the heartbeat, the polling service can more readily terminate the TCP connection and reconnect.
Impact on TCP re transmissions:
Currently when TCP enters re-transmissions resulting in:
Windows doing 5 re-attempts (windows is configured at 5) TCP re transmissions with backoffs: 0.2 + .6 + 1.4 + 3 + 6.2 resulting in 11.4s total time. (assuming the linux backoffs are followed the elastic docs suggest the total time is 6s)
Linux doing 7 re-attempts re transmissions 0.2 + .6 + 1.4 + 3 + 6.2 + 12.6 + 25.4 resulting in 49.4s total time in re-transmission.
Switching this to 30s would result in no change in TCP retransmissions in windows and linux reducing to 6 re transmissions to a total time of 25s. Which means we would do one less re-transmission 25s later.
Although since this is on the heart beat timeout used to detect if the TCP connection is still valid, on failure we create a new TCP connection (resulting in more re-transmissions).
Total latency
This change would reduce the maximum latency of transferring ~25 bytes to around 30s.
How to review this PR
Quality :heavy_check_mark:
Pre-requisites
[ ] I have read How we use GitHub Issues for help deciding when and where it's appropriate to make an issue.
[ ] I have considered informing or consulting the right people, according to the ownership map.
[ ] I have considered appropriate testing for my change.
Background
It probably doesn't take 60s to send/recieve a
NEXT
orPROCEED
control message on a working connection. Heart beat messages are used to quickly detect if the connection is valid, and if it is not a new TCP connection is created. Really high timeouts like60s
will increase the time it takes to determine if a connection is bad, reducing this to half should be both safe and result in detecting dropped TCP connections faster.Why do this
This will benefit listening tentacles (mostly), since they have "pooled connections" which when retrieved from the pool are tested with heartbeat messages (NEXT/PROCEED). Reducing the timeouts means instead of taking
60s
to detect a dead connection it takes30s
.The polling service side also does these control messages, although not on connections that have been sitting idle. Lower timeouts on these messages mean that if the connection fails during the heartbeat, the polling service can more readily terminate the TCP connection and reconnect.
Impact on TCP re transmissions:
Currently when TCP enters re-transmissions resulting in:
Switching this to 30s would result in no change in TCP retransmissions in windows and linux reducing to 6 re transmissions to a total time of 25s. Which means we would do one less re-transmission 25s later.
Although since this is on the heart beat timeout used to detect if the TCP connection is still valid, on failure we create a new TCP connection (resulting in more re-transmissions).
Total latency
This change would reduce the maximum latency of transferring ~25 bytes to around 30s.
How to review this PR
Quality :heavy_check_mark:
Pre-requisites