Missed jobs scheduler not working for retries

StoneFrog commented 1 year ago

According to docs:

Missed Jobs Periodical Handler
Recommended especially when you don't use Sidekiq Pro's super_fetch. If you dequeue job from Redis and the process is killed (by OOM, for example) then good luck with having the job finished. However, if the job is stored in Postgres, this is not an issue. You can just look for the jobs that look as if they were missed and re-enqueue them. Periodically.

That being said it works only for the case when OOM would happen on first attempt. If first attempt fails, then failed_at will be set. Then on retry OOM happens. Missed Jobs Scheduler uses:

def missed_jobs(missed_job_policy:)
      jobs_database
        .where(completed_at: nil, dropped_at: nil, failed_at: nil)
        .select { |potentially_missed_job| missed_job_policy.call(potentially_missed_job) }
    end

to find missed jobs, so as failed_at was assigned - it's never rescheduled.

As I understand the biggest challenge is to figure out which missed job qualifies due to backoff, so perhaps this scenario should be addressed on specific application knowing it's internals. On the other hand whenever it's supposed to be retried we could bump execute_at to match next attempt. That way we could easily find jobs that were supposed to be executed already? That being said - we should at least mention that in documentation as right now it can give false confidence in reliability of this gem.

Azdaroth commented 1 year ago

@StoneFrog How about explicitly assigning "missed" label when rescued for the first time, so that we could use or query, like jobs_database.where(completed_at: nil, dropped_at: nil, failed_at: nil).or(jobs_database.where(completed_at: nil, dropped_at: nil, missed_job: true))? looking for missed jobs is executed every few hours, so having some potential conflicts and risk of duplication and processing the same job twice etc. is very low

StoneFrog commented 1 year ago

@Azdaroth what do you mean with "when rescued"? When failed_at is set for the first time?

Assuming it was actually a legit non-recoverable failure won't that interfere with retry? i.e. even though regular sidekiq would push that to retry queue with exponential backoff, we will be executing that every time missed job scheduler is executed?

Azdaroth commented 1 year ago

@StoneFrog By "rescued" I mean the one that were picked Missed Jobs Handler.

Maybe it would interfere, but what's the risk here? Some jobs might be executed a couple of times and fail each time I guess (otherwise, it would be executed just once) and it would happen only to the one that were labeled as such.

This of course depends on the frequency of the OOMs but they should not be that common to make more than few % of all jobs (which would already be massive but probably we could live with that).

StoneFrog commented 1 year ago

@Azdaroth But that problem is that Missed Job Handler never picked them.

Azdaroth commented 1 year ago

@StoneFrog that's why we would need to introduce some extra labelling and change the query. Wouldn't that be enough ?

StoneFrog commented 1 year ago

@Azdaroth Maybe I don't fully get you proposal, but you have said that: you want to label it when "rescued" and by rescued you mean the one that were picked Missed Jobs Handler.

But in described case they will never be picked by handler, so they will never be labeled and this query won't change a thing 😛

BookingSync / sidekiq-robust-job

Missed jobs scheduler not working for retries #21