contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.73k stars 227 forks source link

Add information about retries to Job Payload metadata #374

Closed pbrisbin closed 2 years ago

pbrisbin commented 3 years ago

Apologies if this exists and I just couldn't find it in the documentation...

Would it be possible to add something like retries_remaining: n to the Job Payload?

We notify Bugsnag of exceptions from Jobs Consumer, but if the Job retries (and succeeds) there's no reason for us to be notified of these, so it's contributing to alert fatigue. If we could see the retries-remaining in the Job Payload, we could trivially elect not to notify Bugsnag until/unless retries_remaining: 0.

mperham commented 3 years ago

You’d use job.failure.retry_count > 5 to only be notified after 5 retries.

https://github.com/contribsys/faktory/blob/master/client/job.go#L18

mperham commented 3 years ago

And BugSnag and other error libraries can often be configured to do this automatically.

pbrisbin commented 3 years ago

Hmm, knowing how many retries have occurred is not the same as knowing how many are remaining.

If I notified when job.failure.retry_count > 5 but then enqueued with retries 3 I'd never get notified, or with retries 6 and I'd notify too early. This also won't work for a Job with retries -1, since job.failure would (presumably) be null on the first (and only) time being consumed -- although I could use that fact itself to make my decision.

However, I also see a job.failure.next_at in the Go code; is it true that a Job that would retry will have a non-null there and a Job that has exhausted its retries will have null?

pbrisbin commented 3 years ago

BugSnag and other error libraries can often be configured to do this automatically

I author the Bugsnag library in Haskell, so I'll be working on this regardless of where :)

pbrisbin commented 3 years ago

Also, it seems you consider job.go documentation by example. Could it be worthwhile to link to that source from the wiki? I always go there to understand what data I have available in these payloads, but it seems it's not comprehensive.

mperham commented 3 years ago

Hmm, knowing how many retries have occurred is not the same as knowing how many are remaining.

I know. I was trying to get you some sort of solution that works today without any changes.

is it true that a Job that would retry will have a non-null there and a Job that has exhausted its retries will have null?

I don't know; you'd need to try it and see what happens. This code was written 3-4 years ago and I don't recall the semantics.

Could it be worthwhile to link to that source from the wiki?

Absolutely, I appreciate the feedback. It's easy to over-document things and scare off new users but it's just as easy to under-document and miss various edge cases, like yours. I always leave my wiki pages publicly editable so people can contribute improvements, that code would be an excellent thing to link on the Job Payload wiki page.

pbrisbin commented 3 years ago

OK, I think I have a solution by comparing job.failure.retry_count to job.retry.

shouldNotifyBugsnag :: Job -> Bool
shouldNotifyBugsnag = isFinalAttempt

isFinalAttempt :: Job -> Bool
isFinalAttempt job = jobFailureRetryCount job >= jobRetry job

jobFailureRetryCount :: Job -> Int
jobFailureRetryCount = maybe 0 failureRetryCount . jobFailure

The approach:

Can you confirm my understanding?

pbrisbin commented 3 years ago

I could probably wrap up that code into a jobRetriesRemaining :: Job -> Int, meaning exactly what I was looking for is derivable from extant data. So (as long as I'm not missing something), let's close this Issue.

pbrisbin commented 3 years ago

Alright, I've done some testing and here is the answer in case it's useful to others.

Consider a Job enqueued with retry: 2 that constantly errors. I see the following behavior as it's consumed and retried:

Iteration Description job.retry job.failure job.failure.retry_count desired retries remaining
1 First time consumed 2 null N/A 2
2 First retry 2 present 0 1
3 Second retry 2 present 1 0

You'll notice that retry - retry_count does not trivially give you a retries remaining. This is due to the fact that failure is null the first time through and has a retry_count 0 on the first retry.

I can (will) make this work now that I've seen the real behavior, but I think you should still consider centralizing a computation of retries_remaining in the server and including it in the job payload to clients. Reasons being:

  1. As I tried to describe, the above is nuanced and easy to get wrong, it seems bad to expect every client to re-implement it
  2. In addition to that, when a job has no retry, I have to assume the Faktory default of 25. This is brittle. If instead the server was handling it, it could reuse its own knowledge of that default.
mperham commented 3 years ago

Thanks, I really appreciate the research and writeup. Would you be willing to send a PR with this implemented? The retry logic is in the manager package. I can do it but it's lower priority than some other things happening now so I'm not sure when I can get to it.

pbrisbin commented 3 years ago

Would you be willing to send a PR with this implemented?

I'll certainly log it in my TODOs to look into. I worry I'd hit some roadblocks in figuring out how to work in a go project. And now that I'll have this in our client code already, I may not be motivated to persevere. But we'll see!

mperham commented 3 years ago

I'll keep this open and implement sometime in the next month or two.