Closed ibrahima closed 6 months ago
I've never been able to design a zero downtime solution unfortunately. The Faktory protocol is stateful so we can't just swap out backends using a reverse proxy. Existing client connections need to re-authenticate with the new server. Essentially you're right with the steps. On a good day, you can probably get those steps to take no more than 30 seconds; bringing down everything is the safest option.
@mperham Does using an external persistent REDIS_URL via faktory enterprise change anything around being able to have a brief overlap as ECS containers are drained/swapped? Right now I have ECS deploys for faktory configured with min: 0%
and max: 100%
to ensure only 1 instance is ever running when we deploy, but I'd love to make that min: 100%
and max: 200%
like all of our other services with zero downtime deployments. If it helps at all we could also pause/unpause all queues around deployments - I would just love for services to be able to continue enqueuing jobs during the process.
I've noticed the same client reconnection issues as @ibrahima where Ruby clients reconnect fine but our node ones do not (we'll be testing that issue with the golang client soon as well) and plan on forking/patching the clients to get that working. The node/golang faktory client TLS support is something else I'd like to try and get working as well (works fine for Ruby).
Does using an external persistent REDIS_URL via faktory enterprise change anything
Nope. Faktory connections don't automatically migrate from old to new so you'd need to reestablish all connections.
Closing as dupe of #372
I noticed that https://github.com/contribsys/faktory/issues/372 is an open issue. In our Faktory installation, we're deploying to AWS ECS. When we do upgrades, it seems like depending on how you handle the deployment there is a chance of jobs getting lost - e.g. if you have two servers at the same time temporarily, some jobs might go to the "older" one and then get lost when the second server comes up, if they aren't persisted to disk in time. (I'm also not exactly sure how it behaves if two servers mount the same persistent volume.) Right now we're pretty early in development so we've just been doing upgrades live, but I am guessing that's not the best approach.
Is the current right way to do a Faktory upgrade, to shut your site down temporarily? To be more explicit:
It might be nice to document this somewhere, but I am not sure where yet. It might depend on how the server is deployed though it seems like the above overall procedure is probably general to most deployment types.
I'm realizing that #372 probably doesn't help in a containerized setup, because the new server will be in a new container and so isn't spawned by a parent process that could share its port or socket. Though with a load balancer in front, you get a similar behavior as the "reused socket" situation, but that still feels non-ideal because you might have jobs go to the "old" server instead of the new one. And since Faktory isn't designed to have multiple running servers (e.g. #447) there's probably no way around that.
Thinking out loud... if you could tell a server to stop accepting jobs once the replacement is online, and have the clients retry operations a few times on failure, then you might be able to achieve something like 0-downtime deploys. But that certainly complicates things further, and it kinda feels like it's better to just minimize the downtime rather than try to handle correctness in these scenarios.