[Suggestion] Move jobs between queues, change the job body, custom metadata for a better failure handling

Revisor commented 8 years ago

Hi, this suggestion is connected to #170 in that both concern the handling of failed jobs.

I would like to handle failed jobs as follows:

NACK the failed job with an ever growing delay (#170)
If the number of retries is higher than X, move the job to a failure queue (dead letter queue) with a new TTL, so that it can be inspected manually and acted upon

Neither of these actions are possible in Disque right now and if using a workaround - adding a new, copied job - we lose both the job ID as well as the NACK and add. delivery counters.

That's why I would like to propose four enhancements (proposals 3. and 4. are different solutions of the same problem):

Allow to NACK a job with a delay (#170)
Allow to move a job to a different queue with a new TTL
Allow callers to change the job body
OR even better, if feasible: Implement custom job metadata support, like NACKs and additional-deliveries but user-defined and mutable

Ad 3. We use the job body to store job metadata. We use metadata to work around missing features 1. and 2. - we store the original job ID as well as the total number of retries there. It could also be helpful to eg. save the exact time and reason the job has failed. This requires changing the existing job body. Supporting custom, mutable job metadata as a first class citizen in Disque would be even better.

The point of all these suggestions is to keep the ID of a job intact throughout its lifetime while allowing for a more complex handling (delayed NACKing, moving between queues, storing extra details).

What do you think? Are the suggestions too complex? Are they useful?

mathieulongtin commented 8 years ago

I kind of like the BURY and KICK command of Beanstalkd for that. When a job is problematic, you bury it, it stays in the queue but is never distributed. If you fix the problem, you can kick it and it will be distributed again.

https://github.com/kr/beanstalkd/blob/v1.3/doc/protocol.txt

Another option for Disque would be to stay pretty bare-boned but allow Lua functions to be loaded for customized behaviour like you're describing. For example, some queue might have a Lua callback on nack that set the retry time, or if too many retries have been done, push the job elsewhere.

On Sun, Feb 28, 2016 at 11:53 AM Revisor notifications@github.com wrote:

Hi, this suggestion is connected to #170 https://github.com/antirez/disque/issues/170 in that both concern the handling of failed jobs.

I would like to handle failed jobs as follows:

NACK the failed job with an ever growing delay (#170 https://github.com/antirez/disque/issues/170)

If the number of retries is higher than X, move the job to a failure queue (dead letter queue) with a new TTL, so that it can be inspected manually and acted upon

Neither of these actions are possible in Disque right now and if using a workaround - adding a new, copied job - we lose both the job ID as well as the NACK and add. delivery counters.

That's why I would like to propose four enhancements (proposals 3. and 4. are different solutions of the same problem):

Allow to NACK a job with a delay (#170 https://github.com/antirez/disque/issues/170)

Allow to move a job to a different queue with a new TTL

Allow callers to change the job body

OR even better, if feasible: Implement custom job metadata support, like NACKs and additional-deliveries but user-defined and mutable

Ad 3. We use the job body to store job metadata. We use metadata to work around missing features 1. and 2. - we store the original job ID as well as the total number of retries there. It could also be helpful to eg. save the exact time and reason the job has failed. This requires changing the existing job body. Supporting custom, mutable job metadata as a first class citizen in Disque would be even better.

The point of all these suggestions is to keep the ID of a job intact throughout its lifetime while allowing for a more complex handling (delayed NACKing, moving between queues, storing extra details).

What do you think? Are the suggestions too complex? Are they useful?

— Reply to this email directly or view it on GitHub https://github.com/antirez/disque/issues/174.

Mathieu Longtin 1-514-803-8977

misiek08 commented 8 years ago

Lua callbacks sounds just sexy. It will allow infinite features to be added. If lua callbacks implementation will have multiple-callback or callback chain (calling next callback given as argument) it would be really great.

antirez / disque

[Suggestion] Move jobs between queues, change the job body, custom metadata for a better failure handling #174