Open ciandonovan opened 9 months ago
I reproduce similar issue with ROS 2 video streaming over WiFi via UDP:
ros2 run v4l2_camera v4l2_camera_node
RUST_LOG=zenoh_transport=debug zenoh-bridge-ros2dds -l udp/0.0.0.0:7447
RUST_LOG=zenoh_transport=debug zenoh-bridge-ros2dds -l udp/0.0.0.0:7447
ros2 run rqt_image_view rqt_image_view
As soon as rqt_image_view
subscribes to /image_raw
such logs appear for bridge on laptop 1:
[2024-01-04T16:51:09Z DEBUG zenoh_transport::unicast::universal::link] Expected SN 147887862, received 147887863 at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/common/defragmentation.rs:68.
[2024-01-04T16:51:09Z DEBUG zenoh_transport::unicast::universal::transport] [d7eaa1bb7fd74a8ec8680d91acc8865c] Closing transport with peer: 68a0035cb1f9064dc7e7fac3d134876
[2024-01-04T16:51:09Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:51329, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.
[2024-01-04T16:51:10Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:52615, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.
Sometime:
[2024-01-04T16:28:43Z DEBUG zenoh_transport::unicast::universal::link] Transport: 9f4fd272bb565b5d9834e8b4342cdc3e. Defragmentation error. at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/universal/rx.rs:153.
[2024-01-04T16:28:43Z DEBUG zenoh_transport::unicast::universal::transport] [e903da926a8f398346b8d7e56dd2ef83] Closing transport with peer: 9f4fd272bb565b5d9834e8b4342cdc3e
[2024-01-04T16:28:43Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:56373, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.
[2024-01-04T16:28:44Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:55246, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.
It seems that some UDP frames are lost or received malformed, which is usual over WiFi (as there are collisions and UDP is not reliable). Still, this makes Zenoh to close the connection. Moreover the remote bridge seems to not be aware of this closure (close message lost?) and the reconnection is refused.
@Mallets : I'm not sure this is the behaviour we want for Zenoh over UDP for non-reliable publications. I think loss of fragments shall lead not lead to disconnection, but to just to drop the message, right ?
@ciandonovan : with significant traffic over WiFi there are always UDP frames loss. Zenoh doesn't yet implement a reliability protocol over UDP transport, meaning even DDS RELIABLE topics won't actually be reliable when routed by the zenoh-bridge-ros2dds
over UDP.
If you need reliability, you should use TCP or QUIC instead of UDP, for the time being.
@JEnoch: thanks for that insight, will experiment with QUIC - TLS/mTLS is a requirement for that though right? Currently not using it with TCP as it's already wrapped in a Wireguard VPN.
@Mallets : I'm not sure this is the behaviour we want for Zenoh over UDP for non-reliable publications. I think loss of fragments shall lead not lead to disconnection, but to just to drop the message, right ?
This sounds ideal. Don't need reliability personally over Zenoh, even for DDS RELIABLE topics, as that reliability is set for intra robot communication, with the Zenoh bridge for real-time remote monitoring where latency is more important.
The reason I was experimenting with UDP was that I discovered Zenoh through this blog https://zenoh.io/blog/2021-09-28-iac-experiences-from-the-trenches/, and UDP was used there. Maybe the CISCO Ultra-Reliable Wireless Backhaul (CURWB) is good enough compared to WiFi that this issue doesn't arise?
Does QUIC solve the head-of-line issue with TCP for Zenoh here too? As in, a larger, slower topic, being retransmitted won't hold up other higher frequency low-bandwidth topics as they'd be separate streams?
I've found anecdotally that the robot is much less responsive to /joy commands (couple of kilobytes) when run alongside a couple of megabytes of /image topics, despite there being significant bandwidth remaining. Naturally there will always be a decrease, but I'm wondering if it's exacerbated by TCP vs over QUIC?
will experiment with QUIC - TLS/mTLS is a requirement for that though right?
Unfortunately, yes - TLS is required by QUIC. But you could just use a same self-signed certificate for all.
Maybe the CISCO Ultra-Reliable Wireless Backhaul is good enough compared to WiFi that this issue doesn't arise?
Possibly. But I also think that in case of Indy Autonomous Challenge, they don't have such big data that need fragmentation when routed over Zenoh. The problem I see (closing connection and can't reconnect) is tied to fragmentation in case of missing fragments.
Does QUIC solve the head-of-line issue with TCP for Zenoh here too?
Probably not yet. As I understand QUIC improves HOL issues if several streams are used within a same QUIC connection. HOL blocking can still occur for a stream, but that won't affect the other streams. Zenoh uses only 1 bi-directional stream so far. We would need to use several, and to bind those to priority levels (as binding per key-expression is not an option since it will likely hit some max number of streams). Then the bridge would need to map the DDS Priority QoS to a Zenoh priority. Finally, you would need to make sure your ROS nodes use different Priority QoS for the relevant topics.
That would indeed be a nice evolution to implement. I suggest you make some tests with current QUIC implementation at first, and then let us know if you still see HOL blocking. Then we'll consider if we add this to the short term roadmap.
Describe the bug
No issues with TCP, but with the exact same configuration with UDP I get about a second or two of streaming, followed by a barrage of messages saying "Remote bridge {GUID} retires {Publisher/Service/Action/etc.}" and then "Route Publisher (ROS:/{TOPIC} -> Zenoh:{TOPIC}) removed"
Connectivity isn't an issue since just replacing
udp
withtcp
in the command argument everything works fine.Using CycloneDDS configured on the localhost only, and loopback multicast force-enabled.
For extra context, running around 60 nodes with 130 topics on a single PC, a lot from the Nav2 stack. WiFi bandwidth at least 150 Mbit/s. When streaming over TCP, around 80 Mbit/s down. Running in Podman OCI containers for convenience, but previously reproduced outside of containers too. Devices both on the same LAN.
To reproduce
-l udp/0.0.0.0:7447
-e udp/{ BRIDGE_IP}:7447
System info