Closed franz6ko closed 3 months ago
I've been running some tests to give more details about this. These are the plots of the ros2 topic hz command on the different scenarios:
ROS2DDS bridge v0.11.0-rc.2 (latest version 1.0.0 has a bug reported here)
DDS bridge v0.10.1-rc (latest version 1.0.0 has a performance decrease reported here)
ROS2DDS bridge compiled with branch fix_190 (#191)
Conclusions:
Some doubts:
Thanks for your explanation. I managed to reproduce the issue and I'm analysing.
First, I actually get the same issue with both zenoh-bridge-ros2dds
and zenoh-bridge-dds
in all 0.11.x
and 1.0.0-alpha.x
versions. Thus it's a regression introduced in 0.11.0
for the 2 repos.
Digging further, when connecting "pure" Zenoh subscriber directly to the bridge on DDS publication side, it well receives the publications at 1000Hz. Then I added a ad-hoc frequency measurement in the bridge on DDS subscription side, just before the call to DDS Write. It also confirms the bridge receives the publications at 1000Hz and is calling DDS write function at the same frequency.
This means the issue is somewhere in the DDS stacks between the bridge and the DDS Reader. I'm now digging further with the help of CycloneDDS team.
If I put a listen section on the config file with a specific network interface on the config file of my router instance, does it ensure all the traffic will occur only over that interface?
If the connect
section is also empty, yes.
Otherwise, the communication could also go via the outgoing connection, and will use an interface chosen by the OS (depending the routing tables)
Is there a way to monitor how loaded the zenoh interfaces are? Some way of monitoring the bandwidth or if it's loosing packages because it can't keep up etc?
We're in the process of building such tools, but they're not yet available.
The culprit is commit 7275cc4 which bumps Zenoh version to a commit just after the change of async library from async-std
to tokio
.
Not sure how tokio
could have an effect on CycloneDDS behaviour, but here we are...
Looking at Wireshark capture, I clearly see a weird behaviour of CycloneDDS when publishing the 1000Hz traffic received via Zenoh:
With previous commit with Zenoh using async-std
, the DDS publications rate was smooth.
It looks like some scheduling issue preventing CycloneDDS to run smoothly.
Hi @JEnoch ! Glad to know you've been able to reproduce the issue and detect the origin !! Thanks for addressing it that quick.
It's not clear to me then if this is a problem that could be addressed in this repository or if it's something to fix on CycloneDDS. I will test ros2dds with a version previous to 0.11.0 meanwhile that I haven't done.
If the connect section is also empty, yes. Otherwise, the communication could also go via the outgoing connection, and will use an interface chosen by the OS (depending the routing tables)
Is by defualt the connect section empty? I can improve my question. By using the config files above and only adding this section to the router bridge (and leaving the client as it is), do I ensure it ?
Imagine that IP correspond to a specific network interface but PCs can connect between them through more than 1 eventually including wifi. I want to be sure traffic is limited to a specific one.
And by the way, is there a way of specifying the NIC instead of the IP ?
We're in the process of building such tools, but they're not yet available.
That would be awesome ! To have an idea of the performance and limitations and evetually detect if there are problems with the network or if we're reaching the edge of the capabilities of either the tool, network etc.
It's not clear to me then if this is a problem that could be addressed in this repository or if it's something to fix on CycloneDDS
Likely in this repo. Possibly in Zenoh repo.
That's not an bug in CycloneDDS, since 0.10.x
was working with the exact same CycloneDDS version than 0.11.0
. It's up to this bridge to adapt the CycloneDDS requirements wrt. scheduling.
BTW, I'm taking some vacations now 😎 Some colleagues might have a look while I'm away. If unfortunately they don't have time for it, don't expect a fix before end of August.
Is by defualt the connect section empty? I can improve my question. By using the config files above and only adding this section to the router bridge (and leaving the client as it is), do I ensure it ?
Yes, if connect
is empty the bridge will not try to establish an outgoing connection to anything. Still it accepts incoming connections via the listen
endpoints.
By default connect
is empty.
Imagine that IP correspond to a specific network interface but PCs can connect between them through more than 1 eventually including wifi. I want to be sure traffic is limited to a specific one.
If you let connect
empty and configure listen
to 1 IP that is bound to only 1 interface, you're sure that the traffic can only go via this interface, via an incoming connection (i.e. remote host has to connect
to this IP)
And by the way, is there a way of specifying the NIC instead of the IP ?
Yes... but only since 0.11.0-rc.1
😕
See this PR: https://github.com/eclipse-zenoh/zenoh/pull/755
Provide some experiments I did. I agree that the issue should not come from CycloneDDS. Also, I tried to disable batching dds_write_set_batch(false)
, and unsurprisingly didn't help.
Here are the FlameGraphs I recorded (But on zenoh-bridge-dds 0.10.1 v.s 0.11.0)
From the graph, it seems like 0.11.0 takes most of the time on both runtime rx0 and rx1, while 0.10.1 only shares one async_std runtime. Not sure whether it's the root cause or not. BTW, this is the zenoh-bridge-dds on the receiver side.
Also provide the FlameGraph on the publisher side for more information 0.10.1 0.11.0
After some vacations 😎 and then further analysis, I found the root cause of the problem and a solution:
Actually the switch from async-std
to tokio
didn't impact CycloneDDS, but the behaviour of Zenoh itself with regard to batching of messages.
Reminder: when a set of small messages are send within a small time windows, Zenoh automatically tries to batch those into a single message. The benefit if less overhead and thus better throughput. The drawback is a small increase of latency for the batched messages.
With async-std
the 1kHz publications were not (or rarely) batched.
With tokio
the 1kHz publications are often send by the 1st bridge (on ROS 2 Publisher side) into 1 or 2 batches each second of ~1000 or ~500 messages. This leads the 2nd bridge to re-publish via CycloneDDS at the same pace, by burst of ~1000 messages each second.
Now, as the ros2 topic hz
command is using a DDS Reader with History QoS set to KEEP_LAST(5) (using qos_profile_sensor_data
), it implies that most of the messages are likely overwriting each other within the Reader history cache. Thus, the command is reporting a lower frequency than what's really received from the network.
The solution:
A solution could be to totally deactivate the batching via this config. But that would have to much impact on the global throughput between the bridges.
A better solution is to allow the user to configure the usage of the Zenoh express
policy that makes Zenoh to to send the message immediately, not waiting for possible further messages. That's now possible with #217 in dev/1.0.0
branch.
@franz6ko Your test should work with the latest commit from dev/1.0.0
branch and if you add this configuration to the bridge on Publisher side:
pub_priorities: ["/position=4:express"]
Zenoh has now been fixed (in branch dev/1.0.0
) to adapt its batching mechanism to the tokio
behaviour:
https://github.com/eclipse-zenoh/zenoh/pull/1335
I can confirm that with default configuration (i.e. without configuring the express
flag for publishers) the routing of 1kHz publications is no longer suffering performance drop.
This repo will be sync to use the fixed Zenoh version in branch dev/1.0.0
in the next few days.
Great to hear!! Thanks for your support. I'm happy to have contributed to this issue.
Describe the bug
I've been doing some simple test bridging only 1 simple topic at a rate of 1000Hz using zenoh-bridge-dds and zenoh-bridge-ros2dds between the ROS_DOMAIN 0 and 1 locally.
With the DDS bridge I get the 1000Hz but with the ROS2DDS bridge I only get around 50Hz
To reproduce
Start 2 zenoh-bridge-dds instances with the following configurations:
Publish a topic on ros domain 0over /position with a rate of 1000Hz Run a "ros2 topic hz /position"
Repeat the same but with 2 zenoh-bridge-ros2dds instances with the following configurations:
System info
Ubuntu 20.04 ROS2 Foxy