basho / riak_core

Distributed systems infrastructure used by Riak.
Apache License 2.0
1.23k stars 392 forks source link

Deadlock reappears in riak_core_worker_pool. #958

Closed rjmh closed 4 years ago

rjmh commented 4 years ago

An old deadlock has reappeared in riak_core_worker_pool. The original problem and its fix are described here (from 2013):

https://github.com/basho/riak_core/issues/298

It appears that the fix didn't entirely solve the problem.

The problem is demonstrated by worker_pool_pulse, in the case when the worker pool is of size 1, and the work to be done is [0,die,0] (three tasks, of which the second one dies). For this test case, about one in 16,000 seeds leads to the bug manifesting; it can be provoked every time by supplying the right seed, as follows.

test_deadlock() ->
    eqc:check(prop_any_pool(),[{{77264,21542,12323},true,[0,die,0]}]).

Briefly, the RCWP uses poolboy to create its workers, but recycles workers for new tasks without passing them back through poolboy. It has two main states, 'ready' and 'queueing'. In the ready state, it asks poolboy for a new worker when new work arrives, until poolboy replies 'full', when it queues the work and enters the queueing state. In the queueing state, new work is always queued, and jobs are handed out to workers from the queue as they check back in (when a job is finished). The RCWP returns to the ready state only when a worker checks in, but there are no jobs in the queue, so the worker is returned to poolboy. RCWP thus remains in the queueing state, even if the queue is empty.

When a worker dies, it will never check back in to RCWP, but poolboy will notice and replace it. The new worker sends worker_start to RCWP, which then checks the worker out from poolboy and gives it a job--if there is work waiting to be done. But if the queue is empty, then RCWP just stays in the same state, and will never ask poolboy for the replacement worker unless another worker checks in while the queue is empty (prompting a return to the ready state). If the pool size is one, then RCWP deadlocks the first time this happens. I believe it could also happen repeatedly, if the work load is high enough that RCWP never returns to the ready state--every time a worker crashes while the queue is empty, that worker will be lost forever, leading to deadlock eventually.

Here is the PULSE visualization of the deadlock. The thing is in the RCWP process on the left; in the last two steps, the worker_start message arrives before the next work.

graph1

martinsumner commented 4 years ago

So, does the problem just come down to this false assumption here:

https://github.com/basho/riak_core/blob/develop-3.0/src/riak_core_worker_pool.erl#L176

The actual state could still be queueing, and as we now know the queue is empty, and a worker is available in this case the StateName can be safely be reverted to ready.

Have I understood this correctly?

rjmh commented 4 years ago

I think so. I'm testing that change now... the same scenario has passed about 5 million tests so far, and I'll run random scenarios next.

rjmh commented 4 years ago

That fix seems to work. I've submitted a pull request.

martinsumner commented 4 years ago

Thank-you! I'll try and take a look soon. Impressive to see pulse back in action.

martinsumner commented 4 years ago

https://github.com/basho/riak_core/pull/959

albsch commented 4 years ago

I know this is a bit late, but a cleanup and rewrite to a gen_statem revealed that the assumption was false in Line 176 (as can be seen in Line 164, Line 187, and Line 227). I wondered back then why it said that these were the only two StateNames that were possible, but didn't think much about what could happen when the state was queueing.