fabiolb / fabio

Consul Load-Balancing made simple
https://fabiolb.net
MIT License
7.25k stars 620 forks source link

Experiencing 502's #862

Closed aal89 closed 2 years ago

aal89 commented 2 years ago

We use Fabio as a load balancer in a Nomad setup with Consul and some backend services (node.Http servers). With millions of requests coming in each month we see that about 1750 requests are getting 502'ed. Percentage wise this is about a ~0.01%, so a very low number. Nevertheless we'd like to solve this issue.

Reading through the issues we found a couple of similar looking ones. For example, see; https://github.com/fabiolb/fabio/issues/721 https://github.com/fabiolb/fabio/issues/716


We typically find these log lines in Fabio at the time of 502 errors:

http: proxy error: read tcp IP:33374->IP:24218: read: connection reset by peer
http: proxy error: EOF

These are indicators for a TCP RST packet, so the backend killed the connection already. We figured that it has to do something with incorrectly configured keepAlive's from both sides. See these two articles https://shuheikagawa.com/blog/2019/04/25/keep-alive-timeout/ & https://docs.apigee.com/api-platform/troubleshoot/runtime/502-bad-gateway for a better explanation. Even though the load balancers are different they do describe the same type of problem.

We tried a setup where we set the proxy.keepalivetimeout to 20s for Fabio and our backend service a server.keepAliveTimeout of 30s. This way the load balancer will always try to kill the connection and not the backend service. However we found that all TCP connectings were still getting into a TIME-WAIT state on the backend service, indicating that the backend service initiated the closing of the socket. No matter what configuration we tried to set for Fabio, it didn't work. The backend service was always initiating closure.

Upon further investigation it seems that the node.Http server actually does some extra stuff when the keepAliveTimeout is being hit. It also destroys the socket (https://github.com/nodejs/node/blob/45b5ca810a16074e639157825c1aa2e90d60e9f6/lib/_http_server.js#L587), this behaviour is not found in Fabio when you set proxy.keepalivetimeout. It just keeps the socket there and eventually the backend service would kill the connection. Because it hits it's own keepAliveTimeout.

Additional testing did gave us some good results though. We found that we have to set IdleConnTimeout* of the http.Transport to signal closure on the Fabio side after timeout. When configured this way (+ timeouts mentioned above) we noticed that the TCP connections on the backend services were no longer getting into a TIME-WAIT state, rather into a CLOSE-WAIT state, indicating that Fabio initiated closure. However, there's no IdleConnTimeout configurable in Fabio.

I was looking for additional insights and feedback. Are we even on the right path? I have prepared a PR that makes setting IdleConnTimeout possible through proxy.idleconntimeout.

Thank you.

*See https://go.dev/src/net/http/transport.go L994 (closeConnIfStillIdle)

nathanejohnson commented 2 years ago

@aal89 can we close this one?

aal89 commented 2 years ago

Yes, we configured our keepalives well (LB & upstream services) and the 502's disappeared.