drfeelngood / resque-batched-job

Resque plugin that understands individual jobs can belong to something bigger than themselves
https://rubygems.org/gems/resque-batched-job
MIT License
49 stars 23 forks source link

after_batch callback not always firing #4

Closed xentek closed 12 years ago

xentek commented 12 years ago

From time to time we'll have jobs that all complete, but that after_batch callback won't fire. When we look further, we'll see that the batch queue in redis still has entries (which is what kept the callback from firing) and we're not sure why this is happening. None of the jobs failed (even though failed jobs are supposed to remove themselves from the batch queue) and there is nothing else we can see that is causing this to happen.

Any ideas?

Our stack:

Rails 3.0.10 Resque 1.19.0 Resque Batched Job 0.1.5

Let me know if you need more information. Thanks!

drfeelngood commented 12 years ago

Do you have multiple worker processes running on the same queue?

xentek commented 12 years ago

We do.

We have 3 queues: High, Medium, Low. We have two worker groups:

On production we have 5 servers running 60 workers each. Each box has 48 "Work One" workers, and 12 "Work Two" workers for a total of 300 workers.

drfeelngood commented 12 years ago

There might be the situation where two instances are altering the batched job list at the same time. I'm currently investigating key locking when updating the batched job list.

xentek commented 12 years ago

Cool! Anything we can do to help?

~ Eric Marden

On Dec 9, 2011, at 9:39 PM, Daniel Johnston reply@reply.github.com wrote:

There might be the situation where two instances are altering the batched job list at the same time. I'm currently investigating key locking when updating the batched job list.


Reply to this email directly or view it on GitHub: https://github.com/drfeelngood/resque-batched-job/issues/4#issuecomment-3089018

drfeelngood commented 12 years ago

Definitely. I'm looking at SETNX and "optimistic locking" using "WATCH". SETNX might be the better solution and it would be cool to put together some tests. Scope out the "lists" branch I have going. That code will be in the next release and probably the best to hack on.

drfeelngood commented 12 years ago

Eric,

I just rolled v1.0.0 which implements a locking scheme. This solves for race conditions which I believe your experiencing. Let me know how it works out.

xentek commented 12 years ago

Wow, that was fast. We have a deploy scheduled tomorrow. I'll be sure to upgrade before we send it out

~ Eric Marden

On Dec 11, 2011, at 11:14 AM, Daniel Johnston reply@reply.github.com wrote:

Eric,

I just rolled v1.0.0 which implements a locking scheme. This solves for race conditions which I believe your experiencing. Let me know how it works out.


Reply to this email directly or view it on GitHub: https://github.com/drfeelngood/resque-batched-job/issues/4#issuecomment-3098221

drfeelngood commented 12 years ago

Are you still experiencing race conditions after the upgrade?

xentek commented 12 years ago

We had some issues when we first tried the new version, but we just finished upgrading our entire stack (Rails and all Gems) and so far so good. Thanks!