breamware / sidekiq-batch

Sidekiq Batch Jobs Implementation
MIT License
357 stars 95 forks source link

Success Callback calling before all jobs are complete #20

Closed davidrichey closed 4 years ago

davidrichey commented 5 years ago

With more logging, I've been able to spot jobs executing after the on_success callback execution. I'm not sure what detail I can provider, but here we go. I am happy to provide more I'm just not sure what else would be helpful.

Redacted logs:

Nov 08 05:09:41pm info Starting batch b-5j1infmvnMQw
Nov 08 05:09:43pm info Running in batch b-5j1infmvnMQw
Nov 08 05:09:44pm info Running in batch b-5j1infmvnMQw
Nov 08 05:09:46pm info Running in batch b-5j1infmvnMQw
Nov 08 05:09:47pm info Running in batch b-5j1infmvnMQw
Nov 08 05:09:48pm info Completed batch b-5j1infmvnMQw
Nov 08 05:09:48pm info Successfully ran batch b-5j1infmvnMQw {"total":0,"failures":0,"pending":0,"created_at":null,"complete":false,"failure_info":[],"parent_bid":null}
Nov 08 05:10:00pm info Running  in batch b-5j1infmvnMQw
Nov 08 05:10:01pm info Running  in batch b-5j1infmvnMQw
Nov 08 05:10:04pm info Running  in batch b-5j1infmvnMQw

_Thought it was strange the total field in the data was 0, testing this locally as well. The total is correct up until the onsuccess is called, it then goes to 0.

Code that creates the batch:

id = '1234' # fake example of parameters
batch = Sidekiq::Batch.new
batch.on(
  :success,
  Integration::MethodWorker,
  id: id
)
batch.on(:complete, MyWorker)
batch.callback_queue = 'queue'
batch

Workers enqueuing

batch.jobs
  MyWorker.perform_in(2.seconds)
  MyWorker.perform_in(3.seconds)
  MyWorker.perform_in(4.seconds)
  MyWorker.perform_in(5.seconds)
  MyWorker.perform_in(6.seconds)
  MyWorker.perform_in(7.seconds)
  MyWorker.perform_in(8.seconds) # there is an if statement on this (wouldn't think that matters)
en

Environment:

Ruby: 2.5.0
Rails: 5.1.4
ndbroadbent commented 5 years ago

@davidrichey Did you find a solution for this? It's my first time trying the sidekiq-batch gem, so I had a quick look at the open issues, and this sounds like a serious bug. I guess this only applies to scheduled jobs that run in the future, so maybe I'll be ok for now, because I don't need batches for scheduled jobs.

davidrichey commented 5 years ago

I have not unfortunately, we had to roll our own solution since this was not resolved.

ndbroadbent commented 5 years ago

Oh no, I just ran into this issue as well!

Fortunately I've added a lot of profiling data to my jobs and included timestamps, so I'm 100% sure that the on_success callback was fired too early, and before the job had finished.

I'm actually using database records to track all of my jobs as well. I was doing a sanity check in the on_success callback to ensure that the state had transitioned to processed, but it crashed because it was still pending.

For the job that was still pending and was supposed to be processed, I saved these timestamps:

"job_start_time": "14:10:47.002"
"job_end_time": "14:11:48.628"

I saw that my on_success callback updated the record with an error at 14:11:46, which is before the batch job had even started.

@managr - I was wondering if you are calling the on_success callback as soon as there are no jobs left in the queue? Or are you waiting for the jobs to finish before calling it?

EDIT: In the meantime I might just go back to my old way of managing batches manually with a database lock. I still use Sidekiq, but I track the total and pending counts in the database, and use a database lock to decrement the pending count. Then when it reaches zero I trigger the on_success callback. But I was trying to get away from that, so hopefully this can be fixed.

phildionne commented 5 years ago

I'm running in a similar problem, where a relatively complex batch with children batches is calling on_success as soon as its first child completes.

This is a serious bug that prevents this gem to be reliable for any serious project.

nglx commented 5 years ago

@phildionne any chance that you could test if https://github.com/breamware/sidekiq-batch/pull/26 is fixing your issue or not?

phildionne commented 5 years ago

@managr just tried and it has indeed worked successfully! how far are you from merging #26 ? let me know if you'd like me to do other tests.

nglx commented 5 years ago

@phildionne let me ping @jbrady42, he mentioned that he'll be fixing the PR, I don't really want to do another PR to his one (or grabbing his code).

OrrYakobi commented 5 years ago

Any idea when the PR will be released?

rdsoze commented 4 years ago

Is this released?

nglx commented 4 years ago

Yes, this was released in 0.1.6 https://rubygems.org/gems/sidekiq-batch/versions/0.1.6

github-actions[bot] commented 4 years ago

Stale issue message