Cysharp / R3

The new future of dotnet/reactive and UniRx.
MIT License
1.71k stars 70 forks source link

`ReplaySubject` race condition #228

Open nepolak opened 1 week ago

nepolak commented 1 week ago

I've spotted a bug with ReplaySubject, which may leave newly-subscribed observer without some new notifications in rare cases in a current version (v. 1.1.12)

When an observer subscribes to ReplaySubject, it receives stored notifications under a lock, but then it is registered to ReplaySubject without the lock. If in between these lines, OnNext is called with a new notification from another thread, the observer will miss the new notification because its emission is not synchronized with new subscriptions.

dotnet/reactive implements ReactiveSubject with synchronization of both notification emission and subscription.

Easy way to fix that is to put subscription (SubscribeCore) and notification emission (OnNext) under the lock too.

neuecc commented 1 week ago

Thanks for the details, this issue is important and I will check it out right away.

ninjaoxygen commented 1 week ago

Just to confirm we are also seeing what sounds like the same bug. It looks like a really short window of race only on the first OnNext/Value call. We have a test case with just ReactiveProperty -> Select -> Switch. In production we only see the bug around 1 in 200.

In our case, the Select constructs a new ReactiveProperty, subscribes over the network and calls OnNext within ~2ms. If we add a 10ms delay before the OnNext, it solves the problem, but then we need to add more locks to avoid out-of-order delivery.

If we swap the Select for SelectAwait before the Switch the bug goes away.

nepolak commented 1 week ago

Just to confirm we are also seeing what sounds like the same bug. It looks like a really short window of race only on the first OnNext/Value call. We have a test case with just ReactiveProperty -> Select -> Switch. In production we only see the bug around 1 in 200.

In our case, the Select constructs a new ReactiveProperty, subscribes over the network and calls OnNext within ~2ms. If we add a 10ms delay before the OnNext, it solves the problem, but then we need to add more locks to avoid out-of-order delivery.

If we swap the Select for SelectAwait before the Switch the bug goes away.

Seems like race condition also, because OnNext is not synchronized with subscriptions. Current value of ReactiveProperty is emitted to a new observer without a lock, and only subscription is under the lock. It is possible to call OnNext, which doesn't seem to have any synchronization at all, in between these lines and have a notification missing.

nepolak commented 1 week ago

If we swap the Select for SelectAwait before the Switch the bug goes away.

I'd rather reimplement ReactiveProperty with locking in OnNext and SubscribeCore and see if it goes away. I think the problem is in ReactiveProperty and not in Select/SelectAwait or Switch.

nepolak commented 1 week ago

If we swap the Select for SelectAwait before the Switch the bug goes away.

I'd rather reimplement ReactiveProperty with locking in OnNext and SubscribeCore and see if it goes away. I think the problem is in ReactiveProperty and not in Select/SelectAwait or Switch.

Just investigated Switch, it seems fine, I don't see any possibility for race condition there.

neuecc commented 1 week ago

Thank you, once we release the current fix code that includes the ReplaySubject issue, we will look into this issue as soon as possible.

nepolak commented 1 week ago

Thank you, once we release the current fix code that includes the ReplaySubject issue, we will look into this issue as soon as possible.

I've created a pull request for the ReactiveProperty issue