Open StoneFrog opened 1 year ago
@StoneFrog How about explicitly assigning "missed" label when rescued for the first time, so that we could use or
query, like jobs_database.where(completed_at: nil, dropped_at: nil, failed_at: nil).or(jobs_database.where(completed_at: nil, dropped_at: nil, missed_job: true))? looking for missed jobs is executed every few hours, so having some potential conflicts and risk of duplication and processing the same job twice etc. is very low
@Azdaroth what do you mean with "when rescued"? When failed_at
is set for the first time?
Assuming it was actually a legit non-recoverable failure won't that interfere with retry? i.e. even though regular sidekiq would push that to retry queue with exponential backoff, we will be executing that every time missed job scheduler is executed?
@StoneFrog By "rescued" I mean the one that were picked Missed Jobs Handler.
Maybe it would interfere, but what's the risk here? Some jobs might be executed a couple of times and fail each time I guess (otherwise, it would be executed just once) and it would happen only to the one that were labeled as such.
This of course depends on the frequency of the OOMs but they should not be that common to make more than few % of all jobs (which would already be massive but probably we could live with that).
@Azdaroth But that problem is that Missed Job Handler never picked them.
@StoneFrog that's why we would need to introduce some extra labelling and change the query. Wouldn't that be enough ?
@Azdaroth Maybe I don't fully get you proposal, but you have said that: you want to label it when "rescued" and by rescued you mean the one that were picked Missed Jobs Handler.
But in described case they will never be picked by handler, so they will never be labeled and this query won't change a thing 😛
According to docs:
That being said it works only for the case when OOM would happen on first attempt. If first attempt fails, then
failed_at
will be set. Then on retry OOM happens. Missed Jobs Scheduler uses:to find missed jobs, so as failed_at was assigned - it's never rescheduled.
As I understand the biggest challenge is to figure out which missed job qualifies due to backoff, so perhaps this scenario should be addressed on specific application knowing it's internals. On the other hand whenever it's supposed to be retried we could bump
execute_at
to match next attempt. That way we could easily find jobs that were supposed to be executed already? That being said - we should at least mention that in documentation as right now it can give false confidence in reliability of this gem.