Closed alondahari closed 11 months ago
Faktory doesn't support multi-server clustering. You can set up Redis replicas for data redundancy but only one Faktory can be operational at a time.
Oh that's not great. It's a bit scary to have one point of failure like that. I believe we contacted you about this earlier and you said it would be possible to have two servers behind a load balancer, but I'll double check that.
You can but they need to run completely independent. For instance if you have two with the same configuration, they will both fire cron jobs. There's no concept of primary/standby or cluster leadership.
Hi again, we updated our configuration to have a "primary" and "secondary" servers, with only the primary scheduling cron jobs. This seems to work for scheduling jobs, but for some reason we're seeing jobs complete and the being enqueued again, about half an hour after the job completed. Any ideas about what's would be the cause of that? the servers are sharing one redis (elasticache) instance.
Any help would be most appreciated!
I think 30 minutes is the default job reservation time. If you fetch job A from server 1 but don't ack it within 30 minutes, Faktory will re-enqueue job A for re-execution under the assumption that the worker failed somehow.
Interesting. I wonder why it's not sending ack then. Also doesn't seem like it should be caused by a multi-server setup.
Is it possible your worker is FETCHing from Server 1 but ACKing to Server 2?
It is possible, but wouldn't it update the same redis record?
Faktory keeps a list of outstanding jobs in memory, keyed by JID. If it's not in its records, it'll return an error. You can argue that Faktory should go to Redis for this operation but as Faktory is not designed to run in parallel, that's the way it works today.
Hmmm I'm not sure how to proceed here then... would it be possible to change the implementation there? do you agree that there should be a capability of having redundancy on the Faktory server?
Faktory does not support clustering. You can use Redis replicas to get a real-time backup. I'm not planning any changes here, redundancy is worthwhile but would add a tremendous amount of complexity to the system. Are you trying to solve a real problem or an imaginary one? Is Faktory reliability really an issue for you?
It's not an issue we faced, but I wouldn't call it imaginary. We have redundancy with almost every part of our infrastructure, being proactive about mitigating issues rather than waiting until something fails.
Faktory reliability is an issue since if the server goes down enqueueing jobs will fail, which will result in errors affecting our users directly.
Our other option is to have a fallback of saving jobs to another database if the enqueueing fails, but I would consider that a hacky workaround. Open to any suggestions you might have.
The one option you can roll yourself is two shards. You can run two independent Faktorys in two different AZs/DCs and have your clients use some algorithm to choose their nearest Faktory. If that Faktory is unavailable, they can push to the backup Faktory.
But at the end of the day, Faktory is a primary datastore. Just as you don't really expect your app to be usable when postgres goes down, you should expect similar semantics here. I haven't had a report of Faktory crashing in over a year now? So I hope you can test and verify Faktory's reliability.
We have an enterprise account with a faktory server deployed on our EKS cluster with 2 replicas behind a load balancer. Seems to work great except for the fact that our scheduled jobs are being scheduled once per faktory server.
How do we make sure they are only scheduled once? the duplication is causing some issues.
Thanks!