Closed pbrisbin closed 2 years ago
You’d use job.failure.retry_count > 5
to only be notified after 5 retries.
https://github.com/contribsys/faktory/blob/master/client/job.go#L18
And BugSnag and other error libraries can often be configured to do this automatically.
Hmm, knowing how many retries have occurred is not the same as knowing how many are remaining.
If I notified when job.failure.retry_count > 5
but then enqueued with retries 3
I'd never get notified, or with retries 6
and I'd notify too early. This also won't work for a Job with retries -1
, since job.failure
would (presumably) be null
on the first (and only) time being consumed -- although I could use that fact itself to make my decision.
However, I also see a job.failure.next_at
in the Go code; is it true that a Job that would retry will have a non-null there and a Job that has exhausted its retries will have null?
BugSnag and other error libraries can often be configured to do this automatically
I author the Bugsnag library in Haskell, so I'll be working on this regardless of where :)
Also, it seems you consider job.go
documentation by example. Could it be worthwhile to link to that source from the wiki? I always go there to understand what data I have available in these payloads, but it seems it's not comprehensive.
Hmm, knowing how many retries have occurred is not the same as knowing how many are remaining.
I know. I was trying to get you some sort of solution that works today without any changes.
is it true that a Job that would retry will have a non-null there and a Job that has exhausted its retries will have null?
I don't know; you'd need to try it and see what happens. This code was written 3-4 years ago and I don't recall the semantics.
Could it be worthwhile to link to that source from the wiki?
Absolutely, I appreciate the feedback. It's easy to over-document things and scare off new users but it's just as easy to under-document and miss various edge cases, like yours. I always leave my wiki pages publicly editable so people can contribute improvements, that code would be an excellent thing to link on the Job Payload wiki page.
OK, I think I have a solution by comparing job.failure.retry_count
to job.retry
.
shouldNotifyBugsnag :: Job -> Bool
shouldNotifyBugsnag = isFinalAttempt
isFinalAttempt :: Job -> Bool
isFinalAttempt job = jobFailureRetryCount job >= jobRetry job
jobFailureRetryCount :: Job -> Int
jobFailureRetryCount = maybe 0 failureRetryCount . jobFailure
The approach:
job.failure == null
to job.failure.retry_count = 0
retry: 0
or retry: -1
, retry_count
will always be >=
that, and so we notify, as desiredretry_count >= retry
should become true only on the final attempt, which is when we want to notifyCan you confirm my understanding?
I could probably wrap up that code into a jobRetriesRemaining :: Job -> Int
, meaning exactly what I was looking for is derivable from extant data. So (as long as I'm not missing something), let's close this Issue.
Alright, I've done some testing and here is the answer in case it's useful to others.
Consider a Job enqueued with retry: 2
that constantly errors. I see the following behavior as it's consumed and retried:
Iteration | Description | job.retry |
job.failure |
job.failure.retry_count |
desired retries remaining |
---|---|---|---|---|---|
1 | First time consumed | 2 | null | N/A | 2 |
2 | First retry | 2 | present | 0 | 1 |
3 | Second retry | 2 | present | 1 | 0 |
You'll notice that retry - retry_count
does not trivially give you a retries remaining. This is due to the fact that failure
is null the first time through and has a retry_count
0 on the first retry.
I can (will) make this work now that I've seen the real behavior, but I think you should still consider centralizing a computation of retries_remaining
in the server and including it in the job payload to clients. Reasons being:
retry
, I have to assume the Faktory default of 25
. This is brittle. If instead the server was handling it, it could reuse its own knowledge of that default.Thanks, I really appreciate the research and writeup. Would you be willing to send a PR with this implemented? The retry logic is in the manager
package. I can do it but it's lower priority than some other things happening now so I'm not sure when I can get to it.
Would you be willing to send a PR with this implemented?
I'll certainly log it in my TODOs to look into. I worry I'd hit some roadblocks in figuring out how to work in a go project. And now that I'll have this in our client code already, I may not be motivated to persevere. But we'll see!
I'll keep this open and implement sometime in the next month or two.
Apologies if this exists and I just couldn't find it in the documentation...
Would it be possible to add something like
retries_remaining: n
to the Job Payload?We notify Bugsnag of exceptions from Jobs Consumer, but if the Job retries (and succeeds) there's no reason for us to be notified of these, so it's contributing to alert fatigue. If we could see the retries-remaining in the Job Payload, we could trivially elect not to notify Bugsnag until/unless
retries_remaining: 0
.