Closed andycox closed 1 year ago
The worker does not need to make any effort to restart jobs or handle worker crashes. You are welcome to retry network operations in the case of a network blip but Faktory will handle crashed workers automatically, jobs will timeout after 30 minutes and re-enqueue. Yes, if the job finished before the ACK failed, then the job will execute twice, as you correctly note.
As I mentioned on the Faktory Gitter, I'm in the process of implementing a JVM worker library after finding the existing ones abandoned and/or insufficient for our needs. My hope is to eventually open source it, but it's not quite ready for prime time.
I have much of the basic functionality working and am starting to look at robustifying the error handling for things like ephemeral network or Faktory server outages. I wanted to get your thoughts on best practices for how/whether worker libraries should support retrying Faktory commands, such as in the following example:
With the worker unable to reach the Faktory server, obviously the ACK/FAIL command cannot be succesfully sent. Should the worker library handle automatically retrying the command up to N times until it gives up? Or should that decision be left up to the application assuming the library lets the application know the command has failed (e.g., exception, return code)? Or should we just leave it up to the Faktory server to eventually time out on the job and release for re-execution?
Assuming the job is idempotent, it seems to come down mostly to a question of how long we're willing to wait for the job to be retried. Is there anything I'm missing?