albrow / jobs

A persistent and flexible background jobs library for go.
MIT License
499 stars 47 forks source link

Find a better way to purge stale pools and re-queue stale jobs #6

Open albrow opened 9 years ago

albrow commented 9 years ago

Currently, there is a process during initialization in which a pool pings all the other pools to determine if any of them have gone down. If they have, any jobs that belong to that pool that were still executing are considered stale and a re-queued. This will prevent jobs from staying stale as long as any time a worker pool machine goes down, either it is rebooted or another machine takes its place.

It would be better if this process occurred periodically instead of just on initialization. The frequency of the pings should be configurable.

albrow commented 9 years ago

On second thought, having each worker pool periodically ping every other worker pool might be okay for a small number of pools. But if the number of pools gets really large it could become impractical. Going to research other ideas. Would appreciate hearing any suggestions!

utrack commented 9 years ago

Maybe create a zset workerid-time and make every pool refresh its entry every 10 seconds or so?

If any pool haven't managed to refresh its entry in, say, 20 sec, it would be considered down, its jobs requeued and its entry removed from the zset. If two pools found out about some other pool that just gone down they'll just exhaust the third pool's job queue without any side effects, thanks to Redis.

However, each pool should also check if its entry wasn't removed from the DB - if we want to guarantee that one job is executed once, that is.