ReactiveX / RxCpp

Reactive Extensions for C++
Apache License 2.0
3.03k stars 390 forks source link

Deadlock in multicast_observer #555

Open mxgrey opened 3 years ago

mxgrey commented 3 years ago

I've run into a deadlock that I can't seem to reproduce a minimal example of. My case appears to be a very rare race condition, and the only way I've found to reproduce it reliably is by repeatedly running a large set of convoluted unit tests (which were written for an application I'm working on) until it happens to get triggered in one of the runs. I often have to leave the tests running on repeat for 1-2 hours (that's potentially hundreds of reruns) before I see the deadlock happen. I still don't know what exact conditions need to align to cause it, but luckily I do know what the stack trace is when it happens (ordered from bottom of the stack to top of the stack):

  1. multicast_observer::add
  2. subscriber::add
  3. composite_subscription::add
  4. composite_subscription_inner::add
  5. composite_subscription_state::add
  6. subscription::unsubscribe
  7. subscription_state::unsubscribe
  8. static_subscription::unsubscribe
  9. multicast_observer::add::<lambda>

The deadlock happens because this mutex gets locked twice in this one thread (as shown in the stack trace above): [i] and [ii].

In most cases this won't happen because this whole branch is protected by the condition that the observer is subscribed, so we can usually rely on this condition to prevent frame [5] in the stack trace from being run.

The race condition appears to be that somehow between frame [1] and frame [5] another thread changes the observer's state from subscribed to unsubscribed. As I mentioned at the start I haven't figured out a way to minimally reproduce this, but assuming it's possible for another thread to change the observer to unsubscribed, it should be clear from the stack trace that what I've described is a deadlock hazard.

This race condition was happening for me on release v4.1.0, which I understand is a few years behind master, but the problematic code path seems to still exist, as the lines I linked above are from the latest master.

A very easy way to fix this problem is to change this std::mutex to a std::recursive_mutex (and of course change the template parameter on the locking mechanisms that use it). I'm happy to provide a PR to fix this, but I don't know how to make a regression test to prove the fix.