fabiolb / fabio

Consul Load-Balancing made simple
https://fabiolb.net
MIT License
7.25k stars 621 forks source link

Graceful shutdown starting with the health check #815

Open dnrce opened 3 years ago

dnrce commented 3 years ago

I'm very impressed by how easy Fabio is to set up, but I'm unsure of the best way to run it in HA configuration.

I'm currently running Fabio as a Docker container with one instance per Docker host. Then upstream of that I have a dumber load balancer (an AWS ALB) that routes traffic to the available Fabio instances.

The problem I have is that if a Fabio container needs to be rescheduled, there's no good way to gracefully deregister it from the upstream load balancer while that happens. It'll simply disappear and then reappear, and during the interim the upstream LB hits a dead end with any traffic sent this container's way.

One solution would be to have only the health check start responding with non-200 during the shutdown delay, rather than have Fabio as a whole stop accepting new requests. This would let Fabio continue to handle requests sent its way while the upstream LB health check's state caught up. As long as the shutdown delay were longer than the LB health check timeout, it would be deregistered before Fabio actually stopped processing requests.

Would this make sense? Or am I simply doing it wrong? Is there an existing feature or an alternative arrangement I'm missing?

Thanks!

far-blue commented 3 years ago

Sounds like a good idea to me. I have a similar setup with traditional load balancers on the edge of my network managed by our DC provider using HAProxy or Nginx or something and all they can 'see' is the Fabio health status. They check every 30s I think. It would be good if the health status could change more than 30s before Fabio stops accepting new connections so the LBs have a chance to notice.

We use Nomad and I know it can deregister services and then wait before actually migrating, updating or terminating them so it would also be nice if Fabio could monitor it's own service record (which we have Nomad managing, not Fabio) in Consul and tie into that functionality.

ketzacoatl commented 2 years ago

@dnrce,

The problem I have is that if a Fabio container needs to be rescheduled, there's no good way to gracefully deregister it from the upstream load balancer while that happens. It'll simply disappear and then reappear, and during the interim the upstream LB hits a dead end with any traffic sent this container's way.

If you need to keep as many requests as possible, a reasonable strategy is to start at the top and go back.

Eg, if you have ALB --> Target Group --> Fabio ---> Other stuff

And if fabio is a cluster of EC2 instances, you roll each instance of fabio by first de-registering the EC2 instance from the target group, then updating the fabio container/instance/whatever, and then put the EC2 instance back into the target group. That way, you disconnect the "soon to be down instance" before it goes down, and you don't have a lag between when service goes out and the upstream figures that out via the health check poll.