eclipse-cyclonedds / cyclonedds

Eclipse Cyclone DDS project
https://projects.eclipse.org/projects/iot.cyclonedds
Other
844 stars 350 forks source link

Cyclone DDS with iox-roudi missing messages in ROS2 Galactic #1458

Open gecastro opened 1 year ago

gecastro commented 1 year ago

Hi there,

I have a cyclone configuration used shared memory (iox-roudi), the configuration file is:

    <Domain id="any">
        <General>
                <AllowMulticast>spdp</AllowMulticast>
                <EnableMulticastLoopback>true</EnableMulticastLoopback>
        </General>
        <SharedMemory>
                <Enable>true</Enable>
                <LogLevel>info</LogLevel>
        </SharedMemory>
    </Domain>

Then I run the joystick controller and ros topic:

    ros2 launch teleop_twist_joy teleop-launch.py
    ros2 topic hz /cmd_vel

The joystick controller can publish a relatively large amount of small messages. The observed behavior is that after a few seconds I stop seeing any output from the /cmd_vel topic.

When I use iox-introspection-client I can see that the chunks in use increase from 16 (before running ros topic) to 273 (after running ros topic), that is the instant where I stop seeing messages.

I assume that the queue in roudi is full, but I'm not sure why, since there is only one subscriber. I observe the same behavior with another script where the subscriber topic has been explicitly set to size 1.

I don't observe this issue if I remove the shared memory setting, so I assume that the issue is related to Cyclone/Roudi (I'm using ROS Galactic)

So my questions are:

wjbbupt commented 1 year ago

Hello, I see you use it on ros. I want to ask a question. Are your things running on arm?

eboasson commented 1 year ago

This is quite intriguing, because ros2 topic hz uses the "sensor data" profile (https://github.com/ros2/ros2cli/blob/02e9bb2c5c567aaee9136de97ebca2ca6439c920/ros2topic/ros2topic/verb/hz.py#L270 — I assume this hasn't changed recently), which is a "keep last 5" reader (https://github.com/ros2/rmw/blob/8b5cefaa054d48f5034e2a3495b7bf1115c13088/rmw/include/rmw/qos_profiles.h#L27-L28) so one would not expect so many chunks in use.

Perhaps a failure to allocate a chunk (which should never happen, because the history depths are checked) could result in a stall, though then one really needs to know whether it is just the subscriber that has a problem, or also the publisher.

  • Is there any suggestion to debug this issue?

It should be possible to determine which processes are stalled, and that might give a clue as to where the problem is. I could also try reproducing it, that's probably easier than trying to write down what I would do once I have it reproduced 😂 The tricky bit is that attending ROSCon and doing a few touristy things in Kyoto wasn't good for keeping up with issues ... so hopefully it is not urgent.

  • Does cyclone provide any method to exclude certain topics to use shared memory?

Not officially, it is something that no one got around to yet but it is obviously useful and not so hard to do. If you need a quick hack to work around the problem, then checking the topic name in https://github.com/eclipse-cyclonedds/cyclonedds/blob/94e520c8cb3ad0e8ec2e27545f8b75ff3e0287b8/src/core/ddsc/src/dds_reader.c#L473 would suffice.

The proper implementation would check if the topic matches a "network partition", and if it does, whether the iceoryx "interface" is included in there. It is pretty straightforward for anyone familiar with that part of the code, but not so straightforward if it is all new ...