docker-archive / dockercloud-haproxy

HAproxy image that autoreconfigures itself when used in Docker Cloud
https://cloud.docker.com/
651 stars 187 forks source link

Old process hanging around after reloads #172

Closed markvr closed 7 years ago

markvr commented 7 years ago

I've been doing some testing around reliability of reconfiguration when services change. From reading the docs, when haproxy reloads, a new process is created with the new config and attaches to the socker. When all TCP connections to the old process have died, the process also dies.

I've seen situations where the old haproxy process hangs around longer than expected (at least several minutes), with browsers still being connected to it. This returns a 500 page because after Docker redeploys the service, the original backend containers no longer exist.

This is presumably because http keepalives are keeping the tcp connection open?

I can see that we could give existing connections a period to hopefully finish, but at some point anything hanging will have broken, and so needs to be killed.

Basically, how about adding a configurable timeout to the reload that kills the old process off eventually?

I'm not really a python developer, but I've have a dig around in the code, and it doesn't look like it would be too tricky - the main file appears to be https://github.com/docker/dockercloud-haproxy/blob/e52edde30e03403834f909ca0db7368af8cecffd/haproxy/helper/update_helper.py

I'll have a go at implementing this and send a PR if it sounds like a sensible idea?

tifayuki commented 7 years ago

@markvr

There stop and restart of Haproxy is described here: https://cbonte.github.io/haproxy-dconv/1.6/management.html#4

We are currently using -sf for soft reload. It could lead to the 5xx issue if the application container is destroyed and recreated with a new IP address. However, if the reload is triggered by scale up, for example, the old connection may not be affected. If we switch the reload to a hard reload, all the old connection will be dropped with no exception.

It is more like a trade-off decide which type of reload to use, but I think we could add a environment variable to let the user switch between soft and hard reload.

markvr commented 7 years ago

I've implemented this with a RELOAD_TIMEOUT environment variable, I'll tidy it up and send you a PR tomorrow. My logic is:

-1: use "-st" 0: use "-sf" (default, and current behaviour) >0: use "-sf", but with a timeout in the python that kills the old process if it still exists when the timeout fires. This could be set reasonably high, but at least means the old process will definitely get removed eventually.

I've also got a discussion going on about this at http://discourse.haproxy.org/t/new-connection-handling-when-reloading/1049/4

I think this issue is also related to http keepalives, and the default timeout client 50000 which keeps TCP connections open longer (and connected to out-of-date processes) than needed. I haven't tested using timeout http-keep-alive to a lower value yet, but from the haproxy docs I think that will help prevent clients sticking to out-of-date instances.

tifayuki commented 7 years ago

RELOAD_TIMEOUT looks good to me.

Reading your post again, I think when your new connection encountered a 500 error, it doesn't necessarily mean that the connection was handled by the old haproxy process. Instead, it could be something like the connection is handled by the new haproxy, but in the mean time, since your application is still being bootstrapped, you will receive a 500 error from your new container.

The haproxy is reloaded on the event when your app container is started, but it doesn't necessarily mean your application is ready to use. And here could come the problem.

markvr commented 7 years ago

I understand what you mean, but in my case it is definitely being handled by the old process because I was also checking the haproxy "stats" screen which shows the pids of the haproxy processes and the backend tasks. Also I'm doing rolling updates on a service with 3 replicas and healthchecks, so it should always have at least two working backends.

markvr commented 7 years ago

See PR at #174