bshoshany / thread-pool

BS::thread_pool: a fast, lightweight, and easy-to-use C++17 thread pool library
MIT License
2.21k stars 253 forks source link

[BUG] Deadlock if wait_for_tasks() is called concurrently #110

Closed alugowski closed 1 year ago

alugowski commented 1 year ago

Describe the bug

Deadlock if wait_for_tasks() is called from multiple threads at the same time.

Minimal working example

#include <chrono>
#include <iostream>
#include <thread>
#include "BS_thread_pool.hpp"

using namespace std::chrono_literals;

void func() {
    BS::thread_pool pool{1};
    std::atomic<bool> go{false};
    std::atomic<bool> thread_started{false};

    auto t = std::thread([&]{
        thread_started = true;
        while (!go) {}
        pool.wait_for_tasks();
    });
    pool.push_task([]{std::this_thread::sleep_for(2ms);});
    while (!thread_started) {}
    go = true;
    pool.wait_for_tasks();

    t.join();
}

int main() {
    for (int i = 0; i < 1e6; ++i) {
        func();
        std::cout << i << " ";
        std::cout.flush();
    }
    std::cout << "done" << std::endl;
    return 0;
}

Behavior

Expectation is that both wait_for_tasks() will block until the tasks are completed then both will unblock. In some situations this does not happen, deadlocking one of the threads.

Example execution of above:

$ ./wait_for_tasks_deadlock
0 1 2 3 4 5 6

Edit: Added thread_started to example code to make the problem appear even faster.

bshoshany commented 1 year ago

Nice catch, thanks for posting this issue! I believe this happens if the second wait_for_tasks() is launched exactly when the first one is between the lines if (!waiting) and waiting = true;, so it hasn't yet had a chance to set waiting to true.

The ideal solution is to add an additional mutex to ensure that waiting stays locked until its value is set to true. I will add that in the next release. Meanwhile, a simple solution is to replace in wait_for_tasks()

task_done_cv.wait(tasks_lock, [this] { return (tasks_total == (paused ? tasks.size() : 0)); });

with

task_done_cv.wait(tasks_lock, [this] { return (tasks_total == (paused ? tasks.size() : 0)) || !waiting; });

and similarly in the other two wait functions if desired.

alugowski commented 1 year ago

Note: I added an extra variable to the example code to make the problem appear even faster.

Meanwhile, a simple solution

That does not seem to have an effect.

alugowski commented 1 year ago

if the second wait_for_tasks() is launched exactly when the first one is between the lines if (!waiting) and waiting = true;, so it hasn't yet had a chance to set waiting to true.

Wait, what? Both threads set waiting to true, presumably as it should be.

This actually brings up another possibility: If thread 2 entered wait_for_tasks after thread 1 already set waiting to true, then thread 2 wouldn't wait at all! Sure that's not a deadlock, but it's incorrect behavior.

alugowski commented 1 year ago

FYI, atomics have been specifically built to make test-and-set type operations, well, atomic. So something like if (!waiting) { waiting = true} defeats the purpose of making waiting atomic.

One of these would be more appropriate. I'd recommend the strong variants for this project. https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange

Though in this case if (!waiting) introduces a behavior bug, even if done atomically.

bshoshany commented 1 year ago

You're 100% right, I appear to have made a mistake and wait_for_tasks() is not behaving as expected. However, both the incorrect behavior and the deadlock can be resolved as follows:

  1. Delete the statement if (!waiting) along with its 2 braces, since as you pointed out, that doesn't make sense - if the second thread calls wait_for_tasks(), it will not block as expected. (This essentially returns the function to the form it had in v3.3.0.)
  2. In worker(), simply change task_done_cv.notify_one(); to task_done_cv.notify_all(); so that it notifies all the waiting threads, not just one of them; this was something I actually overlooked in the original design, and it is the cause of the deadlock, as far as I can tell.

I tested this with your minimal working example and did not encounter any deadlocks (I let it count to 30,000 and then stopped it manually). I also have a test in BS_thread_pool_test.cpp to test for deadlocks in wait_for_tasks(), and it too passes successfully with this change.

This fix will be posted within a few days with release v3.5.0 (along with a few other fixes I've been working on).

Thanks again for pointing this out, it was very helpful! :)

alugowski commented 1 year ago

Great!

Delete the statement if (!waiting) along with its 2 braces,

I think that's correct. Notice that now the reproducing code always deadlocks on the first iteration. That's what I expected it to do when I wrote the code, I think the fact it took multiple iterations with stock 3.4.0 is that the second thread would simply not wait.

task_done_cv.notify_all();

Yes, this fixes it!

bshoshany commented 1 year ago

Update: This issue has been fixed in v3.5.0, which was just released. Thanks again!