Open matthew-ivester opened 2 weeks ago
That's intriguing (a.k.a. it is not immediately obvious to me why that race condition would exist), and I certainly consider it wrong not to deliver data to readers for which you have already received a "publication matched" notification.
The scenario is very simple, so perhaps I should simply try it myself first. I have plenty of test code I can copy-and-enhance 🙂
No luck yet in reproducing this ...
You can find my attempt at https://github.com/eboasson/cyclonedds/commit/dee288c428ca0791d1420f2926f93db9a60c4f8a, perhaps you can see if it seems like it should exhibit the same behaviour?
While trying to clean up and improve some unit tests in code using Cyclone, I came across what looks like buggy (or at least undesirable) behavior around matching of local (ie, loopback) DataReaders and DataWriters.
The sequence of events is:
DataWriter
exists for a topic, using a reliable QoS setting (but notTransientLocal
)publication_matched_status().current_count()
to reach 2 or moreon_publication_matched()
callback on theDataWriterListener
we notify the CV to wake up any waiting threadsDataReader
s are created for the topicDataReader
is created,on_publication_matched()
is triggered on theDataWriterListener
publication_matched_status.current_count()
, gets1
, goes back to sleepDataReader
is created,on_publication_matched()
is triggered on theDataWriterListener
publication_matched_status.current_count()
, gets2
write()
on theDataWriter
DataReader
actually receives the sampleThis seems like a race condition, as usually it works fine and both readers get the sample. Occasionally a unit test will fail because the last
DataReader
that was created doesn't receive anything. I would expect thaton_publication_matched()
would only be triggered once a write call will successfully deliver to the newDataReader
. I've only seen this happen when we're creating two or more readers, so there may be some difference in behavior with how the first reader+writer are matched compared to subsequent ones.For the moment we've worked around this by putting the waiting thread back to sleep for a short time (~5ms) after it sees the matched subscriber count tick up to the threshold value. That seems to fix it. But it would be better if this worked reliably.
I can reproduce this with our product's stack, but I don't currently have a straightforward test that can do so on top of
cyclonedds
alone. I can try to create one if needed, or see if we can get it to occur with more detailed logging enabled in Cyclone.