chadrem / workers

Workers is a Ruby gem for performing work in background threads. Multi-threading made simple.
http://remesch.com/2013/01/22/workers-and-tribe-ruby-threading-simplified/
MIT License
247 stars 17 forks source link

Race Condition with Nested Timers Causing Deadlock #9

Open movitto opened 6 years ago

movitto commented 6 years ago

Thanks for the great library, I've been using it in a local project.

I've encountered an issue where a infrequent race condition is causing Threads to deadlock unexpectedly. Specifically, if a context switch occurs during the course of Scheduler#process_overdue in lib/workers/scheduler.rb, betweens the call to @pool.perform and timer.reset, and we switch to a thread that attempts to launch a timer:

If there are other threads running in the pool, no Deadlock error will be raised (I believe ruby will not be able to detect this)

The critical section in question is here: https://github.com/chadrem/workers/blob/master/lib/workers/scheduler.rb#L82

We can demonstrate it with the following code:

require 'workers'

timer = Workers::Timer.new(0.01) do
  puts "T1"

  Workers::Timer.new(0.01)  do
    puts "T2"
  end
end

sleep(5)

Note that unless a context switch occurs exactly between the call to @pool.perform and timer.reset, this issue will not be apparent and the example above will execute successfully. To force this context switch simply add a 'sleep' to the #process_overdue function (not meant for production, just for testing / demonstrating this issue):

      overdue.each do |timer|
        @pool.perform(timer.callback) do
          timer.fire
        end
sleep 0.01   # <===========   force a "context switch"

        timer.reset
        @schedule << timer if timer.repeat
      end

During normal operation, when the context switch does not occur, both Timers run succesfully and all output is seen. During the buggy edge case, we only see the first timer execute, before the mutexs block until the process exits.

chadrem commented 6 years ago

@movitto great analysis of the problem and also for coming up with a simple demo. Nesting timers wasn't something I needed or designed for. Time for this project is very limited these days so I can't look into this. If you come up with a good solution then please send a PR. My recommendation though is look into if you can avoid nesting the timers.

movitto commented 6 years ago

@chadrem thx for the quick response. I tried changing the order of operation, eg resetting the timer before firing the callback and the issue seemed to be resolved, even with the 'artificial' forced context switch.

PR sent. Let me know if this won't work for whatever reason and we can try to figure something else out.

chadrem commented 6 years ago

This issue is most likely solved in PR #10 (leaving a note here in case anyone else runs into it).

chadrem commented 6 years ago

@movitto I started review your code and I am thinking about a potential different solution. I need to dig in deeper, but upon first looking at it, it may be possible to get rid of @mutex in Timer.rb. If so, this would simplify the code since there would only be one mutex in Scheduler.rb and thus a lot less chance of deadlocks. I'm not going to make any changes until I inspect all the code and verify where locks are/aren't needed.