We use Fabio as a load balancer in a Nomad setup with Consul and some backend services (node.Http servers). With millions of requests coming in each month we see that about 1750 requests are getting 502'ed. Percentage wise this is about a ~0.01%, so a very low number. Nevertheless we'd like to solve this issue.
We tried a setup where we set the proxy.keepalivetimeout to 20s for Fabio and our backend service a server.keepAliveTimeout of 30s. This way the load balancer will always try to kill the connection and not the backend service. However we found that all TCP connectings were still getting into a TIME-WAIT state on the backend service, indicating that the backend service initiated the closing of the socket. No matter what configuration we tried to set for Fabio, it didn't work. The backend service was always initiating closure.
Upon further investigation it seems that the node.Http server actually does some extra stuff when the keepAliveTimeout is being hit. It also destroys the socket (https://github.com/nodejs/node/blob/45b5ca810a16074e639157825c1aa2e90d60e9f6/lib/_http_server.js#L587), this behaviour is not found in Fabio when you set proxy.keepalivetimeout. It just keeps the socket there and eventually the backend service would kill the connection. Because it hits it's own keepAliveTimeout.
Additional testing did gave us some good results though. We found that we have to set IdleConnTimeout* of the http.Transport to signal closure on the Fabio side after timeout. When configured this way (+ timeouts mentioned above) we noticed that the TCP connections on the backend services were no longer getting into a TIME-WAIT state, rather into a CLOSE-WAIT state, indicating that Fabio initiated closure. However, there's no IdleConnTimeout configurable in Fabio.
I was looking for additional insights and feedback. Are we even on the right path? I have prepared a PR that makes setting IdleConnTimeout possible through proxy.idleconntimeout.
We use Fabio as a load balancer in a Nomad setup with Consul and some backend services (node.Http servers). With millions of requests coming in each month we see that about 1750 requests are getting 502'ed. Percentage wise this is about a ~0.01%, so a very low number. Nevertheless we'd like to solve this issue.
Reading through the issues we found a couple of similar looking ones. For example, see; https://github.com/fabiolb/fabio/issues/721 https://github.com/fabiolb/fabio/issues/716
We typically find these log lines in Fabio at the time of 502 errors:
These are indicators for a TCP RST packet, so the backend killed the connection already. We figured that it has to do something with incorrectly configured keepAlive's from both sides. See these two articles https://shuheikagawa.com/blog/2019/04/25/keep-alive-timeout/ & https://docs.apigee.com/api-platform/troubleshoot/runtime/502-bad-gateway for a better explanation. Even though the load balancers are different they do describe the same type of problem.
We tried a setup where we set the
proxy.keepalivetimeout
to 20s for Fabio and our backend service aserver.keepAliveTimeout
of 30s. This way the load balancer will always try to kill the connection and not the backend service. However we found that all TCP connectings were still getting into aTIME-WAIT
state on the backend service, indicating that the backend service initiated the closing of the socket. No matter what configuration we tried to set for Fabio, it didn't work. The backend service was always initiating closure.Upon further investigation it seems that the node.Http server actually does some extra stuff when the
keepAliveTimeout
is being hit. It also destroys the socket (https://github.com/nodejs/node/blob/45b5ca810a16074e639157825c1aa2e90d60e9f6/lib/_http_server.js#L587), this behaviour is not found in Fabio when you setproxy.keepalivetimeout
. It just keeps the socket there and eventually the backend service would kill the connection. Because it hits it's ownkeepAliveTimeout
.Additional testing did gave us some good results though. We found that we have to set
IdleConnTimeout
* of thehttp.Transport
to signal closure on the Fabio side after timeout. When configured this way (+ timeouts mentioned above) we noticed that the TCP connections on the backend services were no longer getting into aTIME-WAIT
state, rather into aCLOSE-WAIT
state, indicating that Fabio initiated closure. However, there's noIdleConnTimeout
configurable in Fabio.I was looking for additional insights and feedback. Are we even on the right path? I have prepared a PR that makes setting
IdleConnTimeout
possible throughproxy.idleconntimeout
.Thank you.
*See https://go.dev/src/net/http/transport.go L994 (
closeConnIfStillIdle
)