QubitProducts / bamboo

HAProxy auto configuration and auto service discovery for Mesos Marathon
Apache License 2.0
793 stars 214 forks source link

Support graceful shutdown of haproxy #156

Open drewrobb opened 9 years ago

drewrobb commented 9 years ago

The purpose of this feature is to allow bamboo to shutdown haproxy gracefully in response to a SIGTERM. In my particular use case we have bamboo running in docker behind an aws ELB. The goal is to generate a health check that can remove bamboo from the ELB before bamboo actually exits, so that we can redeploy bamboo without any requests being lost. My particular way of shutting down bamboo is to run docker stop on the container. By default this gives a SIGTERM followed by a SIGKILL 10 seconds later, so a value of GraceSeconds < 10s is reasonable, but the value should be large enought for an upstream balancer to detect that bamboo is unhealthy. Some changes to the dockerization were necessary so that bamboo would actually get the signal-- child processes of bash or sh need to be run with 'exec'.

I've been testing this by running the container, then running something like:

while true; do curl --connect-timeout 2 --max-time 2 localhost:2000/health  -sL -w "%{http_code} %{time_total}  " -o /dev/null; echo $(($(date +%s%N)/1000000)); sleep 0.2; done

And then running docker stop $(docker ps | grep bamboo | awk '{print $1}') the http status should change from 200 to 503 for 5 seconds.

I'm not sure if people would want this on by default, but GraceSeconds is configurable and setting to 0 allows immediate exit. Also, port 2000 is used for health checking. This could be problematic if not running in docker, so maybe my changes to the haproxy_template should be commented out by default.

drewrobb commented 9 years ago

There is a tiny issue here-- when using GraceSeconds, the old haproxy process after a restart will continue to bind on port 80. The kernel will distribute requests between processes in this case rather than send to newest process as we would want. If servers change sufficiently quickly, you might get 503s. I'm looking at a work around sending SIGTTOU and SIGUSR1 to the old haproxy PID to force it to unbind after restarting with -sf option. The haproxy docs say that this should be necessary, but I'm seeing otherwise.

j1n6 commented 9 years ago

This is an interesting and valid use case. The only concern I have is avoid Bamboo shutting down HAProxy, it would help with upgrading and maintenance.

timoreimann commented 9 years ago

@drewrob:, IIUC, your intention is to facilitate a way to disable Bamboo smoothly for maintenance reasons without any downtime involved. Just wondering whether you could tell ELB to take whatever Bamboo/HAProxy combo you want to run maintenance on out of balancing, thus avoiding any Bamboo-stopping-HAProxy control flows.

I am no way familiar with ELB so let me know if there's a blocker on the AWS end I am missing.

drewrobb commented 9 years ago

@timoreimann, yes that is my intention. Your idea would work as well, I wanted to implement it this way so that I didn't have to worry about that process. In fact I'm running bamboo on marathon as well (on a subset of mesos slaves), so I don't have any special procedure to decommission a mesos slave.

@activars it would be possible to have the signal handler only shutdown haproxy on a SIGTERM, and just shutdown bamboo on a SIGINT (although that convention would be a bit weird?). Another idea-- have grace seconds = -1 by default and in that case don't shutdown haproxy, just shutdown bamboo?

timoreimann commented 9 years ago

@drewrobb: How do you make sure that you do not lose any requests when Bamboo shuts down HAProxy (presumably gracefully) on the load balancer end? Does ELB come with some kind of mechanism to retransmit packets to other hosts if one is deemed unavailable?

drewrobb commented 9 years ago

@timoreimann I use the /health endpoint as defined in this PR as a health check for the ELB, with settings such that it will be marked unhealthy in less than GraceSeconds as defined here. I also made sure that the mesos setting docker_top_timeout is large enough. Thus, the ELB will stop sending requests to bamboo well before it has shutdown. Important to note that during the shutdown process, the bamboo instance will keep handling requests as usual, it just will stop getting new requests from the ELB once marked unhealthy. This approach wouldn't work for long running connections such as websockets, but any request that takes less than some amount of time (GraceSeconds minus time it takes for bamboo to be marked unhealthy).

mlerner commented 8 years ago

This would be great to have, @drewrobb!

KidkArolis commented 8 years ago

Cleaning up old PRs, feel free to reopen if still relevant.