[Bug] Performance drop routing 1kHz publications since 0.11.0

eclipse-zenoh / zenoh-plugin-ros2dds

A Zenoh plug-in for ROS2 with a DDS RMW. See https://discourse.ros.org/t/ros-2-alternative-middleware-report/ for the advantages of using this plugin over other DDS RMW implementations.

https://zenoh.io

Other

127 stars 29 forks source link

[Bug] Performance drop routing 1kHz publications since 0.11.0 #192

Closed franz6ko closed 3 months ago

franz6ko commented 4 months ago

Describe the bug

I've been doing some simple test bridging only 1 simple topic at a rate of 1000Hz using zenoh-bridge-dds and zenoh-bridge-ros2dds between the ROS_DOMAIN 0 and 1 locally.

With the DDS bridge I get the 1000Hz but with the ROS2DDS bridge I only get around 50Hz

To reproduce

Start 2 zenoh-bridge-dds instances with the following configurations:

Publish a topic on ros domain 0over /position with a rate of 1000Hz Run a "ros2 topic hz /position"

Repeat the same but with 2 zenoh-bridge-ros2dds instances with the following configurations:

System info

Ubuntu 20.04 ROS2 Foxy

franz6ko commented 4 months ago

I've been running some tests to give more details about this. These are the plots of the ros2 topic hz command on the different scenarios:

ROS2DDS bridge v0.11.0-rc.2 (latest version 1.0.0 has a bug reported here) Test1

DDS bridge v0.10.1-rc (latest version 1.0.0 has a performance decrease reported here) Test3

ROS2DDS bridge compiled with branch fix_190 (#191) Test4

Conclusions:

There is something with the ros2dds bridge that cannot give the same performance as the dds bridge. Maybe due to some overhead on how things are processed (dds redirects node discovery and ros2dds create nodes representing each bridge on each ROS domain). I wonder if this is something expected or not.
The rate reaches for periods of times the 1000Hz but then drops smoothly until a moment when it jumps back to 1000Hz. This is quite strange and may point that there is something odd. I would understand a consistent loss.
The fix for the latest release on branch fix_190 is working as now the bridge can run but presents this same behavior.
So far I am having better results with the DDS bridge as seen (1000Hz almost) but I was trying to switch to the ROS2DDS bridge because sometimes I have lost messages or duplicated messages

Some doubts:

If I put a listen section on the config file with a specific network interface on the config file of my router instance, does it ensure all the traffic will occur only over that interface?
Is there a way to monitor how loaded the zenoh interfaces are? Some way of monitoring the bandwidth or if it's loosing packages because it can't keep up etc?

JEnoch commented 4 months ago

Thanks for your explanation. I managed to reproduce the issue and I'm analysing.

First, I actually get the same issue with both zenoh-bridge-ros2dds and zenoh-bridge-dds in all 0.11.x and 1.0.0-alpha.x versions. Thus it's a regression introduced in 0.11.0 for the 2 repos.

Digging further, when connecting "pure" Zenoh subscriber directly to the bridge on DDS publication side, it well receives the publications at 1000Hz. Then I added a ad-hoc frequency measurement in the bridge on DDS subscription side, just before the call to DDS Write. It also confirms the bridge receives the publications at 1000Hz and is calling DDS write function at the same frequency.

This means the issue is somewhere in the DDS stacks between the bridge and the DDS Reader. I'm now digging further with the help of CycloneDDS team.

JEnoch commented 4 months ago

If I put a listen section on the config file with a specific network interface on the config file of my router instance, does it ensure all the traffic will occur only over that interface?

If the connect section is also empty, yes. Otherwise, the communication could also go via the outgoing connection, and will use an interface chosen by the OS (depending the routing tables)

Is there a way to monitor how loaded the zenoh interfaces are? Some way of monitoring the bandwidth or if it's loosing packages because it can't keep up etc?

We're in the process of building such tools, but they're not yet available.

JEnoch commented 4 months ago

The culprit is commit 7275cc4 which bumps Zenoh version to a commit just after the change of async library from async-std to tokio. Not sure how tokio could have an effect on CycloneDDS behaviour, but here we are...

Looking at Wireshark capture, I clearly see a weird behaviour of CycloneDDS when publishing the 1000Hz traffic received via Zenoh:

it sends 1330 DATA packets in just 12ms
then it sends nothing but a DATA(m) during 1300ms
then it sends another burst of 1310 DATA packets in just 12ms

With previous commit with Zenoh using async-std, the DDS publications rate was smooth.

It looks like some scheduling issue preventing CycloneDDS to run smoothly.

franz6ko commented 4 months ago

Hi @JEnoch ! Glad to know you've been able to reproduce the issue and detect the origin !! Thanks for addressing it that quick.

It's not clear to me then if this is a problem that could be addressed in this repository or if it's something to fix on CycloneDDS. I will test ros2dds with a version previous to 0.11.0 meanwhile that I haven't done.

If the connect section is also empty, yes. Otherwise, the communication could also go via the outgoing connection, and will use an interface chosen by the OS (depending the routing tables)

Is by defualt the connect section empty? I can improve my question. By using the config files above and only adding this section to the router bridge (and leaving the client as it is), do I ensure it ?

Imagine that IP correspond to a specific network interface but PCs can connect between them through more than 1 eventually including wifi. I want to be sure traffic is limited to a specific one.

And by the way, is there a way of specifying the NIC instead of the IP ?

We're in the process of building such tools, but they're not yet available.

That would be awesome ! To have an idea of the performance and limitations and evetually detect if there are problems with the network or if we're reaching the edge of the capabilities of either the tool, network etc.

JEnoch commented 4 months ago

It's not clear to me then if this is a problem that could be addressed in this repository or if it's something to fix on CycloneDDS

Likely in this repo. Possibly in Zenoh repo. That's not an bug in CycloneDDS, since 0.10.x was working with the exact same CycloneDDS version than 0.11.0. It's up to this bridge to adapt the CycloneDDS requirements wrt. scheduling.

BTW, I'm taking some vacations now 😎 Some colleagues might have a look while I'm away. If unfortunately they don't have time for it, don't expect a fix before end of August.

Is by defualt the connect section empty? I can improve my question. By using the config files above and only adding this section to the router bridge (and leaving the client as it is), do I ensure it ?

Yes, if connect is empty the bridge will not try to establish an outgoing connection to anything. Still it accepts incoming connections via the listen endpoints. By default connect is empty.

Imagine that IP correspond to a specific network interface but PCs can connect between them through more than 1 eventually including wifi. I want to be sure traffic is limited to a specific one.

If you let connect empty and configure listen to 1 IP that is bound to only 1 interface, you're sure that the traffic can only go via this interface, via an incoming connection (i.e. remote host has to connect to this IP)

And by the way, is there a way of specifying the NIC instead of the IP ?

Yes... but only since 0.11.0-rc.1 😕 See this PR: https://github.com/eclipse-zenoh/zenoh/pull/755

evshary commented 3 months ago

Provide some experiments I did. I agree that the issue should not come from CycloneDDS. Also, I tried to disable batching dds_write_set_batch(false), and unsurprisingly didn't help.

Here are the FlameGraphs I recorded (But on zenoh-bridge-dds 0.10.1 v.s 0.11.0) perf_0 10 1 perf_0 11 0

From the graph, it seems like 0.11.0 takes most of the time on both runtime rx0 and rx1, while 0.10.1 only shares one async_std runtime. Not sure whether it's the root cause or not. BTW, this is the zenoh-bridge-dds on the receiver side.

evshary commented 3 months ago

Also provide the FlameGraph on the publisher side for more information 0.10.1 zenoh-bridge-dds_perf_0 10 1_pub 0.11.0 zenoh-bridge-dds_perf_0 11 0_pub

JEnoch commented 3 months ago

After some vacations 😎 and then further analysis, I found the root cause of the problem and a solution:

Actually the switch from async-std to tokio didn't impact CycloneDDS, but the behaviour of Zenoh itself with regard to batching of messages. Reminder: when a set of small messages are send within a small time windows, Zenoh automatically tries to batch those into a single message. The benefit if less overhead and thus better throughput. The drawback is a small increase of latency for the batched messages.

With async-std the 1kHz publications were not (or rarely) batched. With tokio the 1kHz publications are often send by the 1st bridge (on ROS 2 Publisher side) into 1 or 2 batches each second of ~1000 or ~500 messages. This leads the 2nd bridge to re-publish via CycloneDDS at the same pace, by burst of ~1000 messages each second. Now, as the ros2 topic hz command is using a DDS Reader with History QoS set to KEEP_LAST(5) (using qos_profile_sensor_data), it implies that most of the messages are likely overwriting each other within the Reader history cache. Thus, the command is reporting a lower frequency than what's really received from the network.

The solution: A solution could be to totally deactivate the batching via this config. But that would have to much impact on the global throughput between the bridges. A better solution is to allow the user to configure the usage of the Zenoh express policy that makes Zenoh to to send the message immediately, not waiting for possible further messages. That's now possible with #217 in dev/1.0.0 branch.

@franz6ko Your test should work with the latest commit from dev/1.0.0 branch and if you add this configuration to the bridge on Publisher side: pub_priorities: ["/position=4:express"]

JEnoch commented 3 months ago

Zenoh has now been fixed (in branch dev/1.0.0) to adapt its batching mechanism to the tokio behaviour: https://github.com/eclipse-zenoh/zenoh/pull/1335
I can confirm that with default configuration (i.e. without configuring the express flag for publishers) the routing of 1kHz publications is no longer suffering performance drop.

This repo will be sync to use the fixed Zenoh version in branch dev/1.0.0 in the next few days.

franz6ko commented 3 months ago

Great to hear!! Thanks for your support. I'm happy to have contributed to this issue.