Duplicated scheduled jobs with multiple faktory servers - Githubissues

contribsys / faktory

Language-agnostic persistent background job server

https://contribsys.com/faktory/

Other

5.78k stars 230 forks source link

Duplicated scheduled jobs with multiple faktory servers #447

Closed alondahari closed 11 months ago

alondahari commented 1 year ago

Which Faktory package and version? 1.6.1
Which Faktory worker package and version? node faktory-worker 4.5.1

We have an enterprise account with a faktory server deployed on our EKS cluster with 2 replicas behind a load balancer. Seems to work great except for the fact that our scheduled jobs are being scheduled once per faktory server.

How do we make sure they are only scheduled once? the duplication is causing some issues.

Thanks!

mperham commented 1 year ago

Faktory doesn't support multi-server clustering. You can set up Redis replicas for data redundancy but only one Faktory can be operational at a time.

alondahari commented 1 year ago

Oh that's not great. It's a bit scary to have one point of failure like that. I believe we contacted you about this earlier and you said it would be possible to have two servers behind a load balancer, but I'll double check that.

mperham commented 1 year ago

You can but they need to run completely independent. For instance if you have two with the same configuration, they will both fire cron jobs. There's no concept of primary/standby or cluster leadership.

alondahari commented 11 months ago

Hi again, we updated our configuration to have a "primary" and "secondary" servers, with only the primary scheduling cron jobs. This seems to work for scheduling jobs, but for some reason we're seeing jobs complete and the being enqueued again, about half an hour after the job completed. Any ideas about what's would be the cause of that? the servers are sharing one redis (elasticache) instance.

Any help would be most appreciated!

mperham commented 11 months ago

I think 30 minutes is the default job reservation time. If you fetch job A from server 1 but don't ack it within 30 minutes, Faktory will re-enqueue job A for re-execution under the assumption that the worker failed somehow.

alondahari commented 11 months ago

Interesting. I wonder why it's not sending ack then. Also doesn't seem like it should be caused by a multi-server setup.

mperham commented 11 months ago

Is it possible your worker is FETCHing from Server 1 but ACKing to Server 2?

alondahari commented 11 months ago

It is possible, but wouldn't it update the same redis record?

mperham commented 11 months ago

Faktory keeps a list of outstanding jobs in memory, keyed by JID. If it's not in its records, it'll return an error. You can argue that Faktory should go to Redis for this operation but as Faktory is not designed to run in parallel, that's the way it works today.

alondahari commented 11 months ago

Hmmm I'm not sure how to proceed here then... would it be possible to change the implementation there? do you agree that there should be a capability of having redundancy on the Faktory server?

mperham commented 11 months ago

Faktory does not support clustering. You can use Redis replicas to get a real-time backup. I'm not planning any changes here, redundancy is worthwhile but would add a tremendous amount of complexity to the system. Are you trying to solve a real problem or an imaginary one? Is Faktory reliability really an issue for you?

alondahari commented 11 months ago

It's not an issue we faced, but I wouldn't call it imaginary. We have redundancy with almost every part of our infrastructure, being proactive about mitigating issues rather than waiting until something fails.

Faktory reliability is an issue since if the server goes down enqueueing jobs will fail, which will result in errors affecting our users directly.

Our other option is to have a fallback of saving jobs to another database if the enqueueing fails, but I would consider that a hacky workaround. Open to any suggestions you might have.

mperham commented 11 months ago

The one option you can roll yourself is two shards. You can run two independent Faktorys in two different AZs/DCs and have your clients use some algorithm to choose their nearest Faktory. If that Faktory is unavailable, they can push to the backup Faktory.

But at the end of the day, Faktory is a primary datastore. Just as you don't really expect your app to be usable when postgres goes down, you should expect similar semantics here. I haven't had a report of Faktory crashing in over a year now? So I hope you can test and verify Faktory's reliability.