VEuPathDB / lib-compute-platform

Async compute platform core.
1 stars 0 forks source link

Dead Letter Queue, Retries for Failed Async Jobs, & Dead Letter Admin Methods #52

Open Foxcapades opened 7 months ago

Foxcapades commented 7 months ago

Problem

Frequently async jobs fail due to intermittent network hiccups. Currently, when an async job fails, that is it; that job is dead and must be deleted manually.

Proposal

The proposal to work around this issue and resolve most of the network related job failures is a multi-part approach.

  1. The async platform should be configurable to re-queue and re-attempt an async job n times before giving up on that job.
  2. There should be a dead-letter queue where jobs that failed n times go.
  3. The job metadata should include a try-count value that starts at zero and is incremented with every attempt to execute that job.
  4. When a job fails n times its message will be pushed to the dead-letter queue.
  5. A new admin methods will be added to view and pop messages from the dead-letter queue.

1 Configurable Retries

A new option will be added to the async platform that controls how many times a job may be retried before the job is considered permanently failed. This option will be read by the job handler and on failure the job handler will either retry or requeue the job into the RabbitMQ queue from which it was popped.

Decision Points

2 & 4 Dead Letter Queue

A new queue should be added to RabbitMQ for jobs that have failed the configured max number of times. Jobs that are now considered permanently failed will have their message pushed to the dead-letter queue.

Decision Points

3 Job Metadata

The job metadata/message JSON does not allow for arbitrary fields. It will need to be updated to allow for a try count field. This try count field will default to 0. When a job is attempted the try-count value will be incremented. If the job fails, the updated job metadata/message will be pushed to the back of the queue from which it was originally popped.

5 Dead-Letter Queue Operations

There should be at least 2 new methods added to the RabbiMQ wrapper and AsyncPlatform facade.

The first will list all messages on the dead letter queue and push them right back onto the queue after they have been read.

The second will peek the next message with a callback that will return a flag indicating whether the message should be popped from the queue.

AsyncPlatform.nextDeadLetter { message ->
  // Do something
  if (shouldPop)
    return true
  else
    return false
}
Foxcapades commented 7 months ago

So apparently there is already a dead-letter queue in the async platform. This means we can skip point 2.

Point 5 will not be "dead letter queue operations" but instead adding any missing methods that give visibility into failed jobs in the postgres database.