DataWriter appears to receive notification about new loopback DataReader before it is ready to receive data

matthew-ivester commented 2 weeks ago

While trying to clean up and improve some unit tests in code using Cyclone, I came across what looks like buggy (or at least undesirable) behavior around matching of local (ie, loopback) DataReaders and DataWriters.

The sequence of events is:

A DataWriter exists for a topic, using a reliable QoS setting (but not TransientLocal)
- we create a thread that blocks on a condition variable waiting for publication_matched_status().current_count() to reach 2 or more
- In the on_publication_matched() callback on the DataWriterListener we notify the CV to wake up any waiting threads
Two DataReaders are created for the topic
When the first DataReader is created, on_publication_matched() is triggered on the DataWriterListener
- The waiting thread wakes up, calls publication_matched_status.current_count(), gets 1, goes back to sleep
When the second DataReader is created, on_publication_matched() is triggered on the DataWriterListener
- The waiting thread wakes up, calls publication_matched_status.current_count(), gets 2
The waiting thread continues and calls write() on the DataWriter
Only the first DataReader actually receives the sample

This seems like a race condition, as usually it works fine and both readers get the sample. Occasionally a unit test will fail because the last DataReader that was created doesn't receive anything. I would expect that on_publication_matched() would only be triggered once a write call will successfully deliver to the new DataReader. I've only seen this happen when we're creating two or more readers, so there may be some difference in behavior with how the first reader+writer are matched compared to subsequent ones.

For the moment we've worked around this by putting the waiting thread back to sleep for a short time (~5ms) after it sees the matched subscriber count tick up to the threshold value. That seems to fix it. But it would be better if this worked reliably.

I can reproduce this with our product's stack, but I don't currently have a straightforward test that can do so on top of cyclonedds alone. I can try to create one if needed, or see if we can get it to occur with more detailed logging enabled in Cyclone.

eboasson commented 1 week ago

That's intriguing (a.k.a. it is not immediately obvious to me why that race condition would exist), and I certainly consider it wrong not to deliver data to readers for which you have already received a "publication matched" notification.

The scenario is very simple, so perhaps I should simply try it myself first. I have plenty of test code I can copy-and-enhance 🙂

eboasson commented 3 days ago

No luck yet in reproducing this ...

You can find my attempt at https://github.com/eboasson/cyclonedds/commit/dee288c428ca0791d1420f2926f93db9a60c4f8a, perhaps you can see if it seems like it should exhibit the same behaviour?

eclipse-cyclonedds / cyclonedds

DataWriter appears to receive notification about new loopback DataReader before it is ready to receive data #2130