contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.71k stars 228 forks source link

Best practices for Faktory worker library command retry behavior #431

Closed andycox closed 11 months ago

andycox commented 1 year ago

As I mentioned on the Faktory Gitter, I'm in the process of implementing a JVM worker library after finding the existing ones abandoned and/or insufficient for our needs. My hope is to eventually open source it, but it's not quite ready for prime time.

I have much of the basic functionality working and am starting to look at robustifying the error handling for things like ephemeral network or Faktory server outages. I wanted to get your thoughts on best practices for how/whether worker libraries should support retrying Faktory commands, such as in the following example:

  1. Worker pulls a job off a queue and starts working
  2. Either the Faktory server dies unexpectedly (unlikely, I know, but possible) or the route to the server becomes unavailable
  3. The worker finishes the job and tries to ACK it (or it fails and the worker tries to FAIL it)

With the worker unable to reach the Faktory server, obviously the ACK/FAIL command cannot be succesfully sent. Should the worker library handle automatically retrying the command up to N times until it gives up? Or should that decision be left up to the application assuming the library lets the application know the command has failed (e.g., exception, return code)? Or should we just leave it up to the Faktory server to eventually time out on the job and release for re-execution?

Assuming the job is idempotent, it seems to come down mostly to a question of how long we're willing to wait for the job to be retried. Is there anything I'm missing?

mperham commented 1 year ago

The worker does not need to make any effort to restart jobs or handle worker crashes. You are welcome to retry network operations in the case of a network blip but Faktory will handle crashed workers automatically, jobs will timeout after 30 minutes and re-enqueue. Yes, if the job finished before the ACK failed, then the job will execute twice, as you correctly note.