eProsima / Fast-DDS

The most complete DDS - Proven: Plenty of success cases. Looking for commercial support? Contact info@eprosima.com
https://eprosima.com
Apache License 2.0
2.12k stars 757 forks source link

Probabilistic exception (nullptr access to RTPSReader listener) when deleting a DataReader #3303

Open i-and opened 1 year ago

i-and commented 1 year ago

Is there an already existing issue for this?

Expected behavior

The destruction of DDS entities is performed correctly

Current behavior

Access by nullptr pointer

Steps to reproduce

Developed a separate test consisting of one DataReader and one DataWriter (each runs in a separate application). On the part of the DataWriter, the partition name is changed every 2-5 ms to ensure that the discovery mechanism is triggered with a high frequency. The running DataReader is created and deleted every 10/100 ms.

Fast DDS version/commit

2.9.1

Platform/Architecture

Ubuntu Focal 20.04 amd64

Transport layer

UDPv4

Additional context

Nullptr access occurs when one of the methods onReaderMatched(...) is called in the file EDP.cpp

//MATCHED AND ADDED CORRECTLY:
if (r.getListener() != nullptr)
{
  MatchingInfo info;
  info.status = MATCHED_MATCHING;
  info.remoteEndpointGuid = writer_guid;
  r.getListener()->onReaderMatched(&r, info);

  const SubscriptionMatchedStatus& sub_info =
       update_subscription_matched_status(readerGUID, writer_guid, 1);
  r.getListener()->onReaderMatched(&r, sub_info);
}

Code analysis showed that the reason for the error is race condition when accessing the pointer to the RTPS listener in EDP against the background of the ~DataReaderImpl() destructor, which performs an unsecured reset of the RTPS listener in the disable() method:

void DataReaderImpl::disable()
{
    set_listener(nullptr);
    if (reader_ != nullptr)
    {
        reader_->setListener(nullptr);
    }
}

The proposed fix is to exclude the disable() call in the destructor so that it looks like this:

DataReaderImpl::~DataReaderImpl()
{
    // assert there are no pending conditions
    assert(read_conditions_.empty());

    // Disable the datareader to prevent receiving data in the middle of deleting it
    // disable();

    stop();

    delete user_datareader_;
}

In this case, all the necessary locks are provided during the execution of the RTPSParticipantImpl::deleteUserEndpoint() method. At the same time, the test also stopped giving an execution error.

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

jsan-rt commented 1 year ago

Hi @i-and,

Thanks for the report.

Improvements to listener safety were done on PR #2898, merged into master after the 2.7.1 version release. Would you mind running the test again with a newer version? Version 2.7.1 (and the 2.7.x branch as a whole) is no longer a supported version (you can check the current supported releases here).

i-and commented 1 year ago

Hi @jsantiago-eProsima,

I ran a test today on version 2.9.1 - the specified error also occurs there as well as on version 2.7.1. Removing the disable() call in DataReaderImpl::~DataReaderImpl() also fixes the error.

JesusPoderoso commented 1 year ago

Hi @i-and

The error has not been able to being replicated. The disable() method should be called to stop listening, so removing it should not be a possible fix.

Would you mind sharing a reproducible example, so we can analyze deeply?

i-and commented 1 year ago

It turned out that the probability of this error in EDP strongly depends on the operating system used (Windows/Linux) and the number of cores in the CPU. In order to achieve a guaranteed fault, I made an additional synchronization in the code. As a result, using the examples/cpp/Filtering/ example, it was possible to stably reproduce the error. To do this, the application ./DDSFiltering publisher is launched first, then another application ./DDSFiltering subscriber fast is launched in another terminal or under the debugger. The figure shows the result of an error on the part of the subscriber. As can be seen from the debugger panel "VARIABLES" r.mp_listener is 0, which leads to Segmentation fault. fault_edp I also attach the modified 3 files. EDP.cpp.txt FilteringExampleSubscriber.cxx.txt FilteringExamplePublisher.cxx.txt

The disable() method should be called to stop listening, so removing it should not be a possible fix.

After recreating this error, I suggest considering fixing it in the way indicated above (exclude the disable() from ~DataReaderImpl()). It seems to me that this method is most preferable for the following reasons:

Mario-DL commented 1 year ago

Hi @i-and, Thanks for the attached files that helped me reproduce the issue. As @JesusPoderoso said, the solution should not be avoiding the disable() call.

Please, check if the issue persists after applying this patch.

TechVortexZ commented 2 months ago

Hi @i-and, Thanks for the attached files that helped me reproduce the issue. As @JesusPoderoso said, the solution should not be avoiding the disable() call.

Please, check if the issue persists after applying this patch.

Hi @Mario-DL We also meet this bug on the v2.12 branch,this bug can be fixed? this is coredump:

image

use gdb show the listener is nullptr: image

Mario-DL commented 2 months ago

Hi @TechVortexZ We have not included a fix for this yet, although we are aware of it. Could you double check if introducing a std::lock_guard<RecursiveTimedMutex> guard(r.getMutex()); here makes any difference ? BTW, 2.12.x branch was EOL some months ago, we encourage you to move on to 2.14.x. It will be also interesting if you could check whether the issue is reproducible there.

i-and commented 1 month ago

Hi @Mario-DL

  1. This patch is not enough, as it does not protect the following code locations in the same method EDP::pairing_writer_proxy_with_any_local_reader(): https://github.com/eProsima/Fast-DDS/blob/092848725b8425e4f05a8ccf7b3b8d513fabf733/src/cpp/rtps/builtin/discovery/endpoint/EDP.cpp#L1517-L1520 https://github.com/eProsima/Fast-DDS/blob/092848725b8425e4f05a8ccf7b3b8d513fabf733/src/cpp/rtps/builtin/discovery/endpoint/EDP.cpp#L1530-L1540
  2. It seems that it would be logical to protect similar three places in the method EDP::pairing_reader_proxy_with_any_local_writer() since it is structurally similar to the method EDP::pairing_writer_proxy_with_any_local_reader() under consideration.
  3. I propose to implement the on_reader_discovery() method call in the methods StatefulWriter::matched_reader_add(), StatefulWriter::matched_reader_remove(), StatelessWriter::matched_reader_add() and StatelessWriter::matched_reader_remove() by analogy, as it is done in the corresponding four methods of the classes StatefulReader and StatelessReader: namely, use the intermediate local variable ReaderListener* listener to call, which is initialized with the mp_listener value on the mutex std::unique_lock<RecursiveTimedMutex> guard(mp_mutex).