eProsima / Fast-DDS

The most complete DDS - Proven: Plenty of success cases. Looking for commercial support? Contact info@eprosima.com
https://eprosima.com
Apache License 2.0
2.12k stars 757 forks source link

fastdds 2.14.0 The shared memory mode cannot communicate after restarting the process #5053

Closed zhangzhen5729 closed 1 month ago

zhangzhen5729 commented 1 month ago

Is there an already existing issue for this?

Expected behavior

After the process crashes and restarts, the data of the subscribed topic can be received correctly. However, after the process crashes and restarts, it shows that the subscription topic is successful, but no data is received.

Current behavior

However, after the process crashes and restarts, it shows that the subscription topic is successful, but no data is received.

Steps to reproduce

Using the shared memory communication method, there is one publisher and one subscriber. The mouse click on the subscriber console is stuck, and then the console is closed and the subscriber is restarted. You can see that the topic subscription is successful, but no data can be received. This problem does not exist when using UDP.

Fast DDS version/commit

2.14.0 WINDOWS binary installation package downloaded from the official website

Platform/Architecture

Windows 10 Visual Studio 2019

Transport layer

Shared Memory Transport (SHM)

Additional context

FASTDDS 2.14.0

XML configuration file

<?xml version="1.0" encoding="UTF-8" ?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>
        <participant profile_name="HydroTechSurvey">
            <!-- <domainId>4</domainId> -->
            <rtps>
                <name>HydroTechSurvey</name>
                <propertiesPolicy>
                    <properties>
                        <!-- Activate Fast DDS Statistics Module -->
                        <property>
                            <name>fastdds.statistics</name>
                            <value>HISTORY_LATENCY_TOPIC;NETWORK_LATENCY_TOPIC;PUBLICATION_THROUGHPUT_TOPIC;SUBSCRIPTION_THROUGHPUT_TOPIC;RTPS_SENT_TOPIC;RTPS_LOST_TOPIC;HEARTBEAT_COUNT_TOPIC;ACKNACK_COUNT_TOPIC;NACKFRAG_COUNT_TOPIC;GAP_COUNT_TOPIC;DATA_COUNT_TOPIC;RESENT_DATAS_TOPIC;SAMPLE_DATAS_TOPIC;PDP_PACKETS_TOPIC;EDP_PACKETS_TOPIC;DISCOVERY_TOPIC;PHYSICAL_DATA_TOPIC</value>
                        </property>
                    </properties>
                </propertiesPolicy>
            </rtps>
        </participant>
        <data_writer profile_name="datawriter">
            <topic>
                <historyQos>
                    <kind>KEEP_LAST</kind>
                    <depth>1</depth>
                </historyQos>

                <resourceLimitsQos>
                    <max_samples>1</max_samples>
                    <max_instances>1</max_instances>
                    <max_samples_per_instance>1</max_samples_per_instance>
                    <allocated_samples>0</allocated_samples>
                    <extra_samples>10</extra_samples>
                </resourceLimitsQos>
            </topic>

            <qos>
                <reliability>
                    <kind>RELIABLE</kind>
                    <max_blocking_time>
                        <sec>3</sec>
                    </max_blocking_time>
                </reliability>
            </qos>

            <times> <!-- writerTimesType -->
                <initialHeartbeatDelay>
                    <nanosec>12</nanosec>
                </initialHeartbeatDelay>

                <heartbeatPeriod>
                    <sec>3</sec>
                </heartbeatPeriod>

                <nackResponseDelay>
                    <nanosec>5</nanosec>
                </nackResponseDelay>

                <nackSupressionDuration>
                    <sec>0</sec>
                </nackSupressionDuration>
            </times>
            <historyMemoryPolicy>DYNAMIC_REUSABLE</historyMemoryPolicy>

            <matchedSubscribersAllocation>
                <initial>10</initial>
                <maximum>20</maximum>
                <increment>1</increment>
            </matchedSubscribersAllocation>
        </data_writer>
        <data_reader profile_name="datareader">
            <topic>
                <historyQos>
                    <kind>KEEP_LAST</kind>
                    <depth>1</depth>
                </historyQos>

                <resourceLimitsQos>
                    <max_samples>1</max_samples>
                    <max_instances>1</max_instances>
                    <max_samples_per_instance>1</max_samples_per_instance>
                    <allocated_samples>0</allocated_samples>
                    <extra_samples>10</extra_samples>
                </resourceLimitsQos>
            </topic>

            <qos>
                <reliability>
                    <kind>RELIABLE</kind>
                    <max_blocking_time>
                        <sec>15</sec>
                    </max_blocking_time>
                </reliability>
            </qos>

            <times> <!-- readerTimesType -->
                <initialAcknackDelay>
                    <nanosec>70</nanosec>
                </initialAcknackDelay>

                <heartbeatResponseDelay>
                    <nanosec>5</nanosec>
                </heartbeatResponseDelay>
            </times>

            <expectsInlineQos>true</expectsInlineQos>

            <historyMemoryPolicy>DYNAMIC_REUSABLE</historyMemoryPolicy>

            <matchedPublishersAllocation>
                <initial>10</initial>
                <maximum>20</maximum>
                <increment>1</increment>
            </matchedPublishersAllocation>
        </data_reader>
        <topic profile_name="topic">
            <historyQos>
                <kind>KEEP_LAST</kind>
                <depth>1</depth>
            </historyQos>

            <resourceLimitsQos>
                <max_samples>1</max_samples>
                <max_instances>1</max_instances>
                <max_samples_per_instance>1</max_samples_per_instance>
                <allocated_samples>0</allocated_samples>
                <extra_samples>10</extra_samples>
            </resourceLimitsQos>
        </topic>
    </profiles>
</dds>

<?xml version="1.0" encoding="UTF-8" ?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>
        <participant profile_name="hydro_mbgeo_process">
            <!-- <domainId>4</domainId> -->
            <rtps>
                <name>hydro_mbgeo_process</name>
                <propertiesPolicy>
                    <properties>
                        <!-- Activate Fast DDS Statistics Module -->
                        <property>
                            <name>fastdds.statistics</name>
                            <value>HISTORY_LATENCY_TOPIC;NETWORK_LATENCY_TOPIC;PUBLICATION_THROUGHPUT_TOPIC;SUBSCRIPTION_THROUGHPUT_TOPIC;RTPS_SENT_TOPIC;RTPS_LOST_TOPIC;HEARTBEAT_COUNT_TOPIC;ACKNACK_COUNT_TOPIC;NACKFRAG_COUNT_TOPIC;GAP_COUNT_TOPIC;DATA_COUNT_TOPIC;RESENT_DATAS_TOPIC;SAMPLE_DATAS_TOPIC;PDP_PACKETS_TOPIC;EDP_PACKETS_TOPIC;DISCOVERY_TOPIC;PHYSICAL_DATA_TOPIC</value>
                        </property>
                    </properties>
                </propertiesPolicy>
            </rtps>
        </participant>
        <data_writer profile_name="datawriter">
            <topic>
                <historyQos>
                    <kind>KEEP_LAST</kind>
                    <depth>1</depth>
                </historyQos>

                <resourceLimitsQos>
                    <max_samples>1</max_samples>
                    <max_instances>1</max_instances>
                    <max_samples_per_instance>1</max_samples_per_instance>
                    <allocated_samples>0</allocated_samples>
                    <extra_samples>10</extra_samples>
                </resourceLimitsQos>
            </topic>

            <qos>
                <reliability>
                    <kind>RELIABLE</kind>
                    <max_blocking_time>
                        <sec>5</sec>
                    </max_blocking_time>
                </reliability>
            </qos>

            <times> <!-- writerTimesType -->
                <initialHeartbeatDelay>
                    <nanosec>12</nanosec>
                </initialHeartbeatDelay>

                <heartbeatPeriod>
                    <sec>3</sec>
                </heartbeatPeriod>

                <nackResponseDelay>
                    <nanosec>5</nanosec>
                </nackResponseDelay>

                <nackSupressionDuration>
                    <sec>0</sec>
                </nackSupressionDuration>
            </times>

            <historyMemoryPolicy>DYNAMIC_REUSABLE</historyMemoryPolicy>

            <matchedSubscribersAllocation>
                <initial>10</initial>
                <maximum>20</maximum>
                <increment>1</increment>
            </matchedSubscribersAllocation>
        </data_writer>
        <data_reader profile_name="datareader">
            <topic>
                <historyQos>
                    <kind>KEEP_LAST</kind>
                    <depth>1</depth>
                </historyQos>

                <resourceLimitsQos>
                    <max_samples>1</max_samples>
                    <max_instances>1</max_instances>
                    <max_samples_per_instance>1</max_samples_per_instance>
                    <allocated_samples>0</allocated_samples>
                    <extra_samples>10</extra_samples>
                </resourceLimitsQos>
            </topic>

            <qos>
                <reliability>
                    <kind>RELIABLE</kind>
                    <max_blocking_time>
                        <sec>5</sec>
                    </max_blocking_time>
                </reliability>
            </qos>

            <times> <!-- readerTimesType -->
                <initialAcknackDelay>
                    <nanosec>70</nanosec>
                </initialAcknackDelay>

                <heartbeatResponseDelay>
                    <nanosec>5</nanosec>
                </heartbeatResponseDelay>
            </times>

            <expectsInlineQos>true</expectsInlineQos>

            <historyMemoryPolicy>DYNAMIC_REUSABLE</historyMemoryPolicy>

            <matchedPublishersAllocation>
                <initial>10</initial>
                <maximum>20</maximum>
                <increment>1</increment>
            </matchedPublishersAllocation>
        </data_reader>
        <topic profile_name="topic">
            <historyQos>
                <kind>KEEP_LAST</kind>
                <depth>1</depth>
            </historyQos>

            <resourceLimitsQos>
                <max_samples>1</max_samples>
                <max_instances>1</max_instances>
                <max_samples_per_instance>1</max_samples_per_instance>
                <allocated_samples>0</allocated_samples>
                <extra_samples>10</extra_samples>
            </resourceLimitsQos>
        </topic>
    </profiles>
</dds>

Relevant log output

Output without FASTDDS enabled

Network traffic capture

No response

zhangzhen5729 commented 1 month ago

Once the subscription fails, a long list of files will appear at this location, or as long as this long list of files appears, the topic subscription will be successful, but the data cannot be received. 4b3d346335259ff20a1c78a23914bc3

elianalf commented 1 month ago

Hi @zhangzhen5729, thanks for using Fast DDS. To avoid the application to crash you can handle the signal erased when closing the terminal as:

std::function<void(int)> stop_app_handler;
void signal_handler(
        int signum)
{
    stop_app_handler(signum);
}
// In the application
signal(SIGTERM, signal_handler);

You would also need to clean the folder containing shared memory files because it's probably full. Please, let us know if the problem is solved.

zhangzhen5729 commented 1 month ago

@elianalf If the program has no console and runs in the background, how can we solve the problem of releasing resources after an unexpected crash? This situation can also lead to successful topic subscription, but no data can be received.

baynaaMN commented 1 month ago

Hello, I am also experiencing the problem. @elianalf, why do we need to call fastdds shm clean command separately or manually? can't we put this functionality when dds object releases?

elianalf commented 1 month ago

If the program has no console and runs in the background, how can we solve the problem of releasing resources after an unexpected crash?

There are many signals to handle all kind of situations, even if the program has no console.

why do we need to call fastdds shm clean command separately or manually? can't we put this functionality when dds object releases?

Fast DDS already does the cleanup and releases the resources when the application is correctly closed. When the application does not correctly close, and the error signal is not handled, the internal cleanup is not called and a manual cleanup is necessary.

zhangzhen5729 commented 1 month ago

If the program crashes unexpectedly and no signal is captured, and the FASTDDS resources are not released, resulting in the inability to receive data after restart, what should be done? Should FASTDDS continue to enhance the fault tolerance of unexpected crashes of SHM communication participants?

zhangzhen5729 commented 1 month ago

@elianalf Hello, under Windows, if the program crashes unexpectedly, how can you capture the process crash or exit signal?

OgreTransporter commented 1 month ago

It is good to know that FastDDS stores files in C:\ProgramData\eprosima\fastrtps_interprocesss. It took me a long time to find out why FastDDS suddenly stopped working 👎. The reason was files in this directory that were not deleted because a programme had crashed. I have developed a very simple hotfix for Windows, which hopefully solves the problem permanently. It does not work for static libraries unless you call the corresponding function yourself. I have added a file dllmain.cpp to the library:

#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <filesystem>

static bool win32_test_file_open(std::filesystem::path file)
{
    HANDLE hFile = CreateFileA(file.string().c_str(), GENERIC_READ | GENERIC_WRITE, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    if (hFile == INVALID_HANDLE_VALUE)
        return true; // File is open
    else
    {
        // File not open
        CloseHandle(hFile);
        return false;
    }
}

static void cleanup_eprosima()
{
    std::filesystem::path interprocessdata("C:\\ProgramData\\eprosima\\fastrtps_interprocess");
    if (!std::filesystem::exists(interprocessdata)) return;
    for (const std::filesystem::directory_entry& entry : std::filesystem::directory_iterator(interprocessdata))
    {
        if (!entry.exists()) continue;
        std::filesystem::path file = entry.path();
        if (file.filename().string().ends_with("_el") || file.filename().string().ends_with("_mutex")) continue;
        if (win32_test_file_open(file)) continue;
        std::filesystem::path el(file.string() + "_el");
        if (std::filesystem::exists(el)) std::filesystem::remove(el);
        std::filesystem::path mutex(file.string() + "_mutex");
        if (std::filesystem::exists(mutex)) std::filesystem::remove(mutex);
        std::filesystem::remove(file);
    }
}

BOOL APIENTRY DllMain(HMODULE hModule, DWORD  ul_reason_for_call, LPVOID lpReserved)
{
    switch (ul_reason_for_call)
    {
    case DLL_THREAD_ATTACH:
    case DLL_THREAD_DETACH:
        break;
    case DLL_PROCESS_ATTACH:
    case DLL_PROCESS_DETACH:
        cleanup_eprosima();
        break;
    }
    return TRUE;
}
#endif // _WIN32

The function checks the directory for files. If the files have already been opened by a programme with FastDDS, they are ignored, otherwise they are deleted. As this is executed before the FastDDS code, problematic files are deleted completely.

JesusPoderoso commented 1 month ago

Hi @zhangzhen5729, @baynaaMN, @OgreTransporter. Handling application signals is the responsibility of the application, not the middleware.

@zhangzhen5729

Hello, under Windows, if the program crashes unexpectedly, how can you capture the process crash or exit signal?

The following code is a slightly modified snippet example of signal handling taken from the Fast DDS (master) hello world example (main.cpp). It applies to Linux, MacOS, and Windows:

#include <csignal>

std::function<void(int)> stop_app_handler;
void signal_handler(
        int signum)
{
    stop_app_handler(signum);
}

int main(
        int argc,
        char** argv)
{
    // App initialization
    // ...

    // Implementation of your signal handler
    stop_app_handler = [&](int signum)
        {
            std::cout << "\nSignal #" << std::to_string(signum) << " received, stopping application." << std::endl;
            // Call application destruction methods here
            // ...
        };

    // Examples of handled signals, some of them are not supported in windows
    signal(SIGINT, signal_handler);
    signal(SIGTERM, signal_handler);
#ifndef _WIN32
    signal(SIGQUIT, signal_handler);
    signal(SIGHUP, signal_handler);
#endif // _WIN32 

    // Application loop
    // ...

    return 0;
}

@baynaaMN

why do we need to call fastdds shm clean command separately or manually?

There is no need if the application is correctly closed. The created files are associated with the identifiers (GUIDs) of the different DDS entities, and their corresponding ports. The newly created entities will not be allowed to overwrite the previous files, even though the identifiers and ports are the same. For that reason, those files should be removed once the entity is removed (task performed if the application is correctly closed).

@zhangzhen5729

If the program crashes unexpectedly and no signal is captured, and the FASTDDS resources are not released, resulting in the inability to receive data after restart, what should be done?

The application is responsible for recovering until unexpected crashes. In this recovery process, you should clean those unexpectedly closed SHM files (with fastdds shm clean command, which applies to the previously mentioned OS Linux, MacOS, and Windows).

Therefore, I am moving this issue to the Support section according to the Fast DDS CONTRIBUTING guidelines.