Handling server upgrades?

ibrahima commented 11 months ago

I noticed that https://github.com/contribsys/faktory/issues/372 is an open issue. In our Faktory installation, we're deploying to AWS ECS. When we do upgrades, it seems like depending on how you handle the deployment there is a chance of jobs getting lost - e.g. if you have two servers at the same time temporarily, some jobs might go to the "older" one and then get lost when the second server comes up, if they aren't persisted to disk in time. (I'm also not exactly sure how it behaves if two servers mount the same persistent volume.) Right now we're pretty early in development so we've just been doing upgrades live, but I am guessing that's not the best approach.

Is the current right way to do a Faktory upgrade, to shut your site down temporarily? To be more explicit:

Put your main application into a maintenance mode so that it can't queue new Faktory jobs
Quiet your workers so that they don't pick up new tasks
Wait for workers to finish any in-progress work
Do the upgrade
(potentially) restart your clients/workers so that they can reconnect to the new server (I found that Ruby clients/workers reconnect by themselves after a while, but the Python ones seem to crash and burn)

It might be nice to document this somewhere, but I am not sure where yet. It might depend on how the server is deployed though it seems like the above overall procedure is probably general to most deployment types.

I'm realizing that #372 probably doesn't help in a containerized setup, because the new server will be in a new container and so isn't spawned by a parent process that could share its port or socket. Though with a load balancer in front, you get a similar behavior as the "reused socket" situation, but that still feels non-ideal because you might have jobs go to the "old" server instead of the new one. And since Faktory isn't designed to have multiple running servers (e.g. #447) there's probably no way around that.

Thinking out loud... if you could tell a server to stop accepting jobs once the replacement is online, and have the clients retry operations a few times on failure, then you might be able to achieve something like 0-downtime deploys. But that certainly complicates things further, and it kinda feels like it's better to just minimize the downtime rather than try to handle correctness in these scenarios.

mperham commented 10 months ago

I've never been able to design a zero downtime solution unfortunately. The Faktory protocol is stateful so we can't just swap out backends using a reverse proxy. Existing client connections need to re-authenticate with the new server. Essentially you're right with the steps. On a good day, you can probably get those steps to take no more than 30 seconds; bringing down everything is the safest option.

shuber commented 10 months ago

@mperham Does using an external persistent REDIS_URL via faktory enterprise change anything around being able to have a brief overlap as ECS containers are drained/swapped? Right now I have ECS deploys for faktory configured with min: 0% and max: 100% to ensure only 1 instance is ever running when we deploy, but I'd love to make that min: 100% and max: 200% like all of our other services with zero downtime deployments. If it helps at all we could also pause/unpause all queues around deployments - I would just love for services to be able to continue enqueuing jobs during the process.

I've noticed the same client reconnection issues as @ibrahima where Ruby clients reconnect fine but our node ones do not (we'll be testing that issue with the golang client soon as well) and plan on forking/patching the clients to get that working. The node/golang faktory client TLS support is something else I'd like to try and get working as well (works fine for Ruby).

mperham commented 10 months ago

Does using an external persistent REDIS_URL via faktory enterprise change anything

Nope. Faktory connections don't automatically migrate from old to new so you'd need to reestablish all connections.

mperham commented 6 months ago

Closing as dupe of #372

contribsys / faktory

Handling server upgrades? #461