True zero downtime haproxy reloads

bsideup commented 9 years ago

We can improve how haproxy configuration is reloaded: http://engineeringblog.yelp.com/2015/04/true-zero-downtime-haproxy-reloads.html

(Please note that -sf is mentioned, but it didn't always help)

The same method was applied in AWS OpsWorks: https://github.com/aws/opsworks-cookbooks/pull/40

timoreimann commented 9 years ago

I don't think there's much Bamboo can or should do to the extent that the blog post describes. The solution Yelp outlines requires using OS-specific low-level queuing techniques to prevent SYN packets from getting rejected. I wouldn't want Bamboo to fiddle with my OS tools directly -- that's something a system administrator (or configuration management tools driven by him) should do.

The method applied to the AWS cookbook is specifically not the one Yelp chose: Dropping SYN packets deliberately can cause delays of up to 3 seconds, which is the very reason Yelp decided against it and went with queuing instead.

Finally, the Yelp approach only works for outgoing traffic, which doesn't cover the standard Marathon/Bamboo/HAProxy scenario where we deal with incoming traffic, so it's actually not particularly helpful here.

From my perspective, the only thing the Bamboo project could do is hint at the problem in the documentation and describe ways on how to deal with it. Off the top of my head, I can think of two:

Ignore the problem. If you don't scale at Yelp level, you may be able to just not care for occasional errors on clients' end. Or the set of clients is limited and they retry on connection errors anyways.
Shape traffic from an intermediary. As sketched out in the Blog post, one could put another intermediary node between the clients and HAProxy to have it queue up HAProxy-heading, outgoing SYN packets during reloads.

lclarkmichalek commented 9 years ago

I'm going to agree with @timoreimann on this. If you see in the yelp blog post, they ship the reloading out to a seperate script, which is what I would recommend a bamboo user do; setting your ReloadCommand to qudisk_protect ..... wouldn't cause any problems.

I'm going to close this issue, for the sole reason of reducing the number of open issues that aren't particularly actionable. If/When I get around to rolling out qdisc protected reloads at Qubit I might write some documentation on it, but until then, pull requests welcome :)

QubitProducts / bamboo

True zero downtime haproxy reloads #152