[Bug] Topic subscription failure with multiple zenoh-bridge-ros2dds peers

siteks commented 8 months ago

Describe the bug

This may be related to this and this.

We have a setup that involves a central server connected to multiple robots, all running ROS2 iron in docker containers. The central server provides some topics to all robots. We also need some robot to robot ROS2 communication.

We initially used zenoh-bridge-ros2dds in peer mode at the server and all robots, but experienced non-obvious failures of data transmission on topics between server and robots.

A simplified setup that exhibits the problem is:

server: docker(talker)
server: docker(listener)
server: docker(zenoh-bridge-ros2dds -m peer -l tcp/0.0.0.0:7447)

robot1: docker(zenoh-bridge-ros2dds -m peer -l tcp/0.0.0.0:7447)
robot1: docker(listener)

robot2: docker(zenod-bridge-ros2dds -m peer -l tcp/0.0.0.0:7447)
robot2: docker(listener)

The server containers are started, then the zenoh containers on the robots. Robot1 listener is started, correctly shows received data, then stopped. Robot2 listener is started, may show data, then is stopped. Robot2 listener is started again, does not show any data.

If the robot zenoh containers are changed to clients, connecting to the server ip address, the failure does not occur.

If the listener on the server is not started, the failure seems to occur very rarely.

To reproduce

We have not managed to reproduce this with composed containers on a single host. Server is running Ubuntu 20.04, robot1 and robot2 are running Ubuntu 22.04. The robots are connected over WiFi. The container simonj23/dots_core:iron is a ROS2 iron distribution with CycloneDDS installed.

Run in all cases with config files in the current directory.

cyclonedds.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
    <Domain id="any">
        <General>
            <Interfaces>
                <NetworkInterface name='lo' multicast='true' />
            </Interfaces>
            <DontRoute>true</DontRoute>
        </General>
    </Domain>
</CycloneDDS>

minimal.json5:

{
  plugins: {
    ros2dds: {
      allow: {
          publishers: [ "/chatter", ],
          subscribers: [ "/chatter", ],
      }
    },
  },
}

compose.zenoh_peer.yaml

services:
  zenoh-peer-ros2dds:
    image: eclipse/zenoh-bridge-ros2dds:0.10.1-rc.2
    environment:
      - ROS_DISTRO=iron
      - CYCLONEDDS_URI=file:///config/cyclonedds.xml
      - RUST_LOG=debug,zenoh::net=trace,zenoh_plugin_ros2dds=trace
    network_mode: "host"
    init: true
    volumes:
      - .:/config
    command:
      - -m peer
      - -l tcp/0.0.0.0:7447
      - -c /config/minimal.json5

On server: start talker

docker run -it --rm --network=host \
-e "RMW_IMPLEMENTATION=rmw_cyclonedds_cpp" \
-e "CYCLONEDDS_URI=file:///config/cyclonedds.xml" \
-v .:/config simonj23/dots_core:iron \
bash -c 'source /opt/ros/iron/setup.bash && ros2 run demo_nodes_cpp talker'

start listener

docker run -it --rm --network=host \
-e "RMW_IMPLEMENTATION=rmw_cyclonedds_cpp" \
-e "CYCLONEDDS_URI=file:///config/cyclonedds.xml" \
-v .:/config simonj23/dots_core:iron \
bash -c 'source /opt/ros/iron/setup.bash && ros2 run demo_nodes_cpp listener'

start zenoh

docker compose -f compose.zenoh_peer.yaml up

On robot1 start zenoh

docker compose -f compose.zenoh_peer.yaml up

On robot2 start zenoh

docker compose -f compose.zenoh_peer.yaml up

On robot1 start then stop listener:

docker run -it --rm --network=host -e "RMW_IMPLEMENTATION=rmw_cyclonedds_cpp" -e "CYCLONEDDS_URI=file:///config/cyclonedds.xml" -v .:/config simonj23/dots_core:iron bash -c 'source /opt/ros/iron/setup.bash && ros2 run demo_nodes_cpp listener'
[INFO] [1709553347.923254612] [listener]: I heard: [Hello World: 236]
[INFO] [1709553348.928124335] [listener]: I heard: [Hello World: 237]
[INFO] [1709553349.925678933] [listener]: I heard: [Hello World: 238]
^C[INFO] [1709553350.409051386] [rclcpp]: signal_handler(signum=2)

On robot2 start then stop listener:

docker run -it --rm --network=host -e "RMW_IMPLEMENTATION=rmw_cyclonedds_cpp" -e "CYCLONEDDS_URI=file:///config/cyclonedds.xml" -v .:/config simonj23/dots_core:iron bash -c 'source /opt/ros/iron/setup.bash && ros2 run demo_nodes_cpp listener'
[INFO] [1709553358.925248190] [listener]: I heard: [Hello World: 247]
[INFO] [1709553359.927279191] [listener]: I heard: [Hello World: 248]
^C[INFO] [1709553360.630004049] [rclcpp]: signal_handler(signum=2)

On robot2 start listener:

docker run -it --rm --network=host -e "RMW_IMPLEMENTATION=rmw_cyclonedds_cpp" -e "CYCLONEDDS_URI=file:///config/cyclonedds.xml" -v .:/config simonj23/dots_core:iron bash -c 'source /opt/ros/iron/setup.bash && ros2 run demo_nodes_cpp listener'

At this point, robot2 no longer gets any data on the chatter topic. The situation can be recovered by restarting the zenoh container on the server.

Log files attached. Server IP address is 192.168.0.70, robot1: 192.168.0.101, robot2: 192.168.0.105.

It appears from the server logfile that something may be going wrong with topic unsubscribe. When robot1 listener is stopped, 2024-03-04T11:26:40Z, there are two messages of UndeclareSubscriber, but when robot2 listener is stopped, 2024-03-04T11:26:58Z, there is only one, and the next subscribe does not correctly succeed.

server_log.txt robot1_log.txt robot2_log.txt

System info

Server: Ubuntu 20.04 arm64 Robots: Ubuntu 22.04 arm64 zenoh-bridge-ros2dds: 0.10.1-rc.2

aosmw commented 8 months ago

I think I may have something similar, a 3 way system. I will check our logs for the UndeclareSubscriber pattern you mention above.

I will include my scenario, as a potential data point, although I have not yet been able to make a cut down reproducible example as you have above. I was attempting to reproduce it with ros2 run demo_nodes_cpp add_two_ints. Maybe the underlying critical factor is the 3 zenoh bridges. I will also give that a go on my setup.

I have 3 bare metal ubuntu 22.04 x86_64 humble systems. Currently two development "base stations" connected via lan switch and one "bot" communicated to on a wifi link.

We similarly use the allow feature of the config, tailored for each systems function (basestation or bot). Albeit with more options.

We similarly use the cyclone dds configuration xml file.

At the moment we are developing a lifecycle node on the "bot" that has publishers(vehicle state) and a service(to change vehicle state mode, lights, park brake etc).

We have a simple python gui running on one of the base stations subscribing to the vehicle state, and is able to issue service client requests.

The other basestation is typically running rviz2 and/or plotjuggler.

When we (run, change, compile, restart) a the particular lifecycle node on the "bot" while the gui on one of the basestation is running, we usually get a re-connection of the subscription to the vehicle state, but very very often do not get a re-established connection of the service client from the basestation gui to the bot service server. There are no errors when we make service client requests from the basestation, they are just not received by the service server on the bot. We work around this at the moment by restarting the zenoh-bridge-ros2dds service on the "bot". This is not a nice crutch to use.

Some potential Ideas -

It feels like the service client on the base station does not know that the service server on the bot has changed.
I also thought that maybe I don't clean up the service nicely/correctly enough on the "bot" for zenoh to be happy.
I have also wondered if there were any magic service re-discovery topics I should also whitelist.

siteks commented 8 months ago

I think I may have something similar, a 3 way system. I will check our logs for the UndeclareSubscriber pattern you mention above.

When we (run, change, compile, restart) a the particular lifecycle node on the "bot" while the gui on one of the basestation is running, we usually get a re-connection of the subscription to the vehicle state, but very very often do not get a re-established connection of the service client from the basestation gui to the bot service server. There are no errors when we make service client requests from the basestation, they are just not received by the service server on the bot. We work around this at the moment by restarting the zenoh-bridge-ros2dds service on the "bot". This is not a nice crutch to use.

This does sound very similar, topic subscription is apparently successful with no errors, but no data. I now have a reasonable workaround for my purposes; running only one bridge in peer mode, with the others operating in client mode pointing to the IP of the peer, e.g:

server: docker(talker)
server: docker(listener)
server: docker(zenoh-bridge-ros2dds -m peer -l tcp/0.0.0.0:7447)

robot1: docker(zenoh-bridge-ros2dds -m client -e tcp/192.168.0.70:7447)
robot1: docker(listener)

robot2: docker(zenod-bridge-ros2dds -m client -e tcp/192.168.0.70:7447)
robot2: docker(listener)

Routing of topics works correctly between both robots and the server. This does require that the server is always running, but this is not a limitation right now for us and I have been running fairly intensive traffic today without obvious failures. Perhaps worth a try.

agoeckner commented 7 months ago

We are also experiencing this issue. We worked around it by setting our command and control computer as a client which had all robots listed as endpoints.

TomGrimwood commented 4 months ago

I ran into this issue using zenoh-bridge-ros2dds:0.11.0.

However, using zenoh-bridge-ros2dds:nightly (0.11.0-dev-124-ga742b36) (2 July) I do not.

Going back in the dockers, the 0.11.0-dev-123-ga36b951 image, (21 June) has the issue also.

eclipse-zenoh / zenoh-plugin-ros2dds