albrow / jobs

A persistent and flexible background jobs library for go.
MIT License
502 stars 47 forks source link

Implement spread out retries #4

Open albrow opened 9 years ago

albrow commented 9 years ago

Currently, if a job fails it will be immediately queued for retry. This is appropriate in some but not all circumstances. For example, if a third-party API is down for a few hours, retrying the job immediately would cause it to be retried many times before permanently failing. It would be better to spread out the retries over time. E.g. the first retry is immediate, the next one is 15 minutes later, the next one is 1 hour later, etc.

epelc commented 9 years ago

I think your looking for exponential backoff.

http://en.wikipedia.org/wiki/Exponential_backoff

epelc commented 9 years ago

@albrow would you accept a pr to fix this?

We're running into this in production as we use api's with unreliable uptime especially on weekends. A ton of jobs are failing and we have to manually go restart them.

I think it would require a adding a parameter to the schedule and scheduleRecurring functions. This would be a breaking change. But to avoid these in the future we could switch it to accept a schedule struct instead. This way you could add options without breaking things in the future.

albrow commented 9 years ago

@epelc I'm not going to have time to implement this anytime soon, but I would be happy to review a PR for it :) Couldn't we make this a field (or fields) of PoolConfig with some sensible default values to make it a non-breaking change?

epelc commented 9 years ago

@albrow I think that'd work well if you have a single job type or they are all similar. But if your hitting different apis it'd require separate pools then.

I think we could get away with the pool config in our app but I'm not sure how others are using this. If you have a lot of different job types it might be problematic. Let me know your thoughts though. I'll do either one.

albrow commented 9 years ago

Hmm... as I understand it, one of the great things about using exponential backoff is that it handles a variety of failure conditions pretty well. For example, it will handle both cases where there was a temporary, one-time failure and cases where e.g., a service is down for the weekend. I think this is why delayed_job, a popular ruby gem which I drew some inspiration from, doesn't let you tweak the exponential backoff parameters. My opinion is that we should add one or two parameters to PoolConfig for now. When I finally get the time to fix #14, it will be easier to express different options for individual job types, so we can consider changing this at that time.

epelc commented 9 years ago

Sounds good. I'll add some sort of option to the PoolConfig like you said.