eclipse-zenoh / zenoh-plugin-ros2dds

A Zenoh plug-in for ROS2 with a DDS RMW. See https://discourse.ros.org/t/ros-2-alternative-middleware-report/ for the advantages of using this plugin over other DDS RMW implementations.
https://zenoh.io
Other
127 stars 30 forks source link

[Bug] Transient local messages not cached for multiple publishers #333

Open ottojo opened 2 weeks ago

ottojo commented 2 weeks ago

Describe the bug

On a local ROS 2 setup, when subscribing to a topic with transient local durability, a node receives the cached message from every transient local publisher on the topic. When subscribing to the same topic over the zenoh bridge (version 1.0.2), only the latest message on the topic is published.

This causes problem for example when a node publishes a static transform on /tf_static, which then makes the static transforms published earlier by the robot state publisher unavailable.

To reproduce

  1. Start zenoh bridge on two hosts
  2. On the first host, start two nodes publishing once with transient local durability, such as ros2 run tf2_ros static_transform_publisher --frame-id map --child-frame-id a and ros2 run tf2_ros static_transform_publisher --frame-id map --child-frame-id b
  3. On the first host, observe two messages being received by ros2 topic echo /tf_static
  4. On the second host, observe only the message from the node which started last being received by ? ros2 topic echo /tf_static`

System info

Ubuntu 22.04 in docker on both hosts, using zenoh-bridge-ros2dds standalone executable version 1.0.2 installed in binary form from the private deb repository

JEnoch commented 6 days ago

I did try to reproduce your issue, but I well get 2 messages when running ros2 topic echo /tf_static on the second host. Here are my exact commands (adding some debug logging for the bridges):

For bridge on Host 1, I see such logs:

2024-11-26T15:19:37.781750Z DEBUG tokio-runtime-worker ThreadId(05) zenoh_plugin_ros2dds::route_publisher: Route Publisher (/tf_static -> tf_static): creation with type tf2_msgs/msg/TFMessage
2024-11-26T15:19:37.781912Z DEBUG tokio-runtime-worker ThreadId(05) zenoh_plugin_ros2dds::route_publisher: Route Publisher (/tf_static -> tf_static): caching TRANSIENT_LOCAL publications via a PublicationCache with history=10 (computed from Reader's QoS: history=(KEEP_LAST,1), durability_service.max_instances=-1)
2024-11-26T15:19:37.782256Z DEBUG tokio-runtime-worker ThreadId(05) zenoh_plugin_ros2dds::route_publisher: Route Publisher (/tf_static -> tf_static): congestion_ctrl Block, priority Data, express:false
2024-11-26T15:19:37.782734Z DEBUG tokio-runtime-worker ThreadId(05) zenoh_plugin_ros2dds::route_publisher: Route Publisher (ROS:/tf_static -> Zenoh:tf_static) now serving local nodes {"/static_transform_publisher_pJEDPKxne1WDnuER"}
...
2024-11-26T15:19:43.998629Z DEBUG tokio-runtime-worker ThreadId(09) zenoh_plugin_ros2dds::route_publisher: Route Publisher (ROS:/tf_static -> Zenoh:tf_static) now serving local nodes {"/static_transform_publisher_YEvYvsQEIl2rqKzM", "/static_transform_publisher_pJEDPKxne1WDnuER"}

Meaning the bridge discovered the 1st Publisher on /tf_static with QoS TRANSIENT_LOCAL and KEEP_LAST(1).

By design the bridge creates only 1 route per topic, with an associated PublicationCache for TRANSIENT_LOCAL support. When a remote bridge discovers a Subscriber, it will query historical publications from this cache. By default the bridge dimensions the cache size to history_length * transient_local_cache_multiplier messages where transient_local_cache_multiplier is configurable and set to 10 by default.

Note: writing this I realized that the transient_local_cache_multiplier config was not documented... #342 fixes this.

The last line is the discovery of the 2nd Publisher for which the same route and PublicationCache is used.

For bridge on Host 2, I see such logs:

2024-11-26T15:13:00.376775Z DEBUG tokio-runtime-worker ThreadId(04) zenoh_plugin_ros2dds::route_subscriber: Route Subscriber (Zenoh:tf_static -> ROS:/tf_static) now serving local nodes {"/_ros2cli_37149"}
2024-11-26T15:13:00.376818Z DEBUG tokio-runtime-worker ThreadId(04) zenoh_plugin_ros2dds::route_subscriber: Route Subscriber (Zenoh:tf_static -> ROS:/tf_static) activate
2024-11-26T15:13:00.376845Z DEBUG tokio-runtime-worker ThreadId(04) zenoh_plugin_ros2dds::route_subscriber: Route Subscriber (Zenoh:tf_static -> ROS:/tf_static): query historical messages from everybody for TRANSIENT_LOCAL Reader on @/*/@ros2_pub_cache/tf_static
2024-11-26T15:13:00.379572Z TRACE                 rx-0 ThreadId(13) zenoh_plugin_ros2dds::route_subscriber: Route Subscriber (Zenoh:tf_static -> ROS:/tf_static): routing message - 92 bytes
2024-11-26T15:13:00.379628Z TRACE                 rx-0 ThreadId(13) zenoh_plugin_ros2dds::route_subscriber: Route Subscriber (Zenoh:tf_static -> ROS:/tf_static): routing message - 92 bytes

Meaning the bridge well gets 2 messages (92 bytes each) on topic /tf_static from the Host 2 bridge's cache and well route those 2 messages to the Subscriber (ros2 topic echo command).

Could you please check if you get the same logs and behaviour ?

Note that if your system on Host 1 has more that 10 Publishers on /tf_static topic, you need to increase this transient_local_cache_multiplier config value, thus all the publications fit in the PublicationCache.

muellerbernd commented 5 days ago

With two hosts everything works fine. But with 3 Hosts it's not working on my side as mentioned in #219. My usecase:

All the hosts are connected via wifi. Same output with this here as host 1: