Add retry on exception as config option and per-job option

edraut commented 9 years ago

Some applications need to default to having jobs retry on exception. Let's add that feature as an option.

global app config
per job option

if retry_on_exception is true, then offer these, with sensible defaults.

retry count
retry interval style (e.g. linear, exponential)
retry interval start (e.g. 30 seconds between attempts)

edraut commented 9 years ago

@darkside, does this feature description make sense to you? Not sure when we'll prioritize the work, but I want to document it.

darkside commented 9 years ago

@edraut Yes. For future reference, this is an interesting implementation: https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/middleware/server/retry_jobs.rb

edraut commented 9 years ago

@darkside, I like it in general.

A few caveats:

1) It seems to me that a 'dead job queue' is really just a log of info about jobs that we're not going to try to execute again. As such, it seems that some sort of log service is the most appropriate way to handle dead jobs. It would be awkward and potentially misleading to store them in the same type of queue that is used to execute jobs--we're not going to execute them again without manual intervention. What we need is an easy way to "read" about the job, not a way to retry execution. A message queue is a terrible tool for that job, a logging /exception service is an excellent tool.

2) I would rarely configure a worker system that might execute a job days after I enqueued it. The potential for problems is high, and the likelihood that the user who needed the result still finds it relevant is low. In our app, emails executing days after they were enqueued would in many cases actively mislead the recipient, as would report results. Push notifications would be entirely meaningless. I propose the solution that the RabbitMQ authors put forward: end-to-end acknowledgements. Here are the two cases I'm talking about , and I think Coney Island should just behave simply in both cases, leaving end-to-end ack to the custom app.

Timely Jobs

If the timing of a job is more important than 100% execution rate, the app should just submit it to ConeyIsland and forget it. ConeyIsland should either ack to RabbitMQ on failure (i.e. remove it from RabbitMQ) with a message to the dead-job log, or retry with a time limit in proportion to the 'timeliness' definition of the queue. This decision should be configurable in ConeyIsland. nb. We use job queues based on average execution time, as I recommend everyone should, so 'timeliness' would be defined based on some factor of the average execution time of the queue, e.g. the fast queue has an avg time of 250ms, so the extent of retries for a job might not exceed 2.5 seconds in total.

Jobs which must execute, regardless of when

The app should maintain a store of job statuses as it creates jobs. It submits them to ConeyIsland, and the custom app worker code must send job-complete messages to the app job status store. ConeyIsland will follow it's normal procedure, acking and logging to the dead-queue or retrying first based on configuration. The custom app can implement whatever feature it wants with regards to jobs that have not completed after some period of time. It can resubmit them to ConeyIsland, or perhaps transform them based on time and other data/states before resubmitting them.

Based on extensive experience with heavy-duty background systems, this implementation covers all the use cases with the most stable worker system architecture. If you want eventual execution, you must implement it in your app. It's kind of presumptuous to assume as a generic background worker tool that we know if eventual execution is always more important than timeliness. Whenever an app actually requires eventual execution beyond some reasonable, small timeout (which is extremely rare), it is a complex enough app that it makes sense for the app to decide how to implement such a feature.

One premise here is that one should never design a worker system that keeps jobs in the worker for longer than some reasonable factor of the average execution time. By adhering to this the worker system remains stable and avoids the whole hung-worker problem.

Also, it's worth pointing out that grouping jobs onto workers by avg execution time is a huge boost to utility for a business. A job that takes on average 10 minutes will delay a shorter job whose expected completion time is closer to a few seconds. It would be lovely to be able to decompose every job into tiny chunks. However, by simply grouping jobs by avg execution time we get everything we want without the significant complexity involved in such tasks. Less complexity means fewer bugs and happier users. Even if you decompose the 10 minute job into 600 1-second jobs, your queue will regularly be flooded with sub jobs, which will have the effect that jobs of the quick type will be slowed down by jobs of the slow type, and the result will still be undesirable to the user.

Thoughts?

edraut / coney_island

Add retry on exception as config option and per-job option #2