contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.78k stars 230 forks source link

if faktory work crashes job sits in `busy` until timeout and then never requeues #453

Closed Deekor closed 6 months ago

Deekor commented 1 year ago

I've noticed this behavior: if the faktory worker process crashes (out of memory for example) any jobs that worker was processing sits in busy until reservation timeout and then never enters the retry queue. Is this intended behavior? Im worried about jobs being lost if a crash happens, is there a config I'm missing?

mperham commented 1 year ago

~I believe it doesn't go into retry, it should be directly re-enqueued.~

My mistake, it is handled as a job failure and should go into retries: https://github.com/contribsys/faktory/blob/b3e739a6c10164b3bdd3bf34dda9405964bd4137/manager/working.go#L223

Deekor commented 1 year ago

Interesting. I had a long-running job today that definitely didn't.

The reservation time was 3 hours, it sat in busy for 2 hours after the process died (an hour in) and never made it to retry.

mperham commented 1 year ago

I'll look into it next week. If you can give me a simple reproduction, that would help.

mperham commented 1 year ago

I was able to reproduce a simple crashing scenario. The jobs moved from Busy to Retries and then to Enqueued as expected.

Deekor commented 1 year ago

You're right. Turns out the job had a short circuit in it (on retries) that I didnt notice. It finishes so fast that i didnt even see it in busy or queue.

Deekor commented 9 months ago

Ok, in a similar vein to this. The worker that was running my CampaignStartWorker job crashed. The job retry short circuited. This is all intended behavior as discussed above.

However, the job still sits in busy state running on a ghost process, the job had a custom reservation and unique_for which is blocking me from re-queueing the same job.. even though the job isn't actually in queue or running on a real process.

  faktory_options reserve_for: 10800
  faktory_options custom: { unique_for: 3.hours.to_i }

Screenshot 2024-02-02 at 10 27 48 AM Screenshot 2024-02-02 at 10 27 53 AM Screenshot 2024-02-02 at 10 28 00 AM

mperham commented 9 months ago

If the entire worker process crashes, you'll see the job sit until the reservation timeout passes. This is because Faktory can't tell if the job is still executing (and the network is bad) or if the process died and its pending jobs can be retried soon.