ROS2 Discovery Destroying Local Network

mhl787156 commented 2 years ago

A major problem for multi-robot systems is the DDS discovery mechanism lag. Through talks with the Robotics Lab Dots system, they recommend switching to zenoh (https://github.com/eclipse-zenoh/zenoh-plugin-dds) to manage multi-robot connections, rather than using ROS2 DDS explicitly.

Also: https://zenoh.io/blog/2021-04-28-ros2-integration/

This is a good idea as it also paves the way for future cloud based development, as Zenoh allows internet scale approaches.

mhl787156 commented 2 years ago

Details from Simon Jones with the DOTS

mhl787156 commented 2 years ago

Multiple issues in trying to get zenoh to work, primarily focused on trying to get ROS2 DDS to only work localhost. See the following issues:

ROS_LOCALHOST_ONLY does not work properly for some reason, and it is really hard to work out what it actually does because there is no documentation. See the following posts:
- https://github.com/ros2/ros2/issues/798
- https://discourse.ros.org/t/unconfigured-dds-considered-harmful-to-networks/25689/17
To solve this you need to set the correct fastrtps configuration, see:
- https://github.com/eProsima/Fast-DDS/issues/2624#issuecomment-1088292279
- https://discourse.ros.org/t/unconfigured-dds-considered-harmful-to-networks/25689/17

See https://github.com/StarlingUAS/ProjectStarling/pull/170 as current atempt

mhl787156 commented 2 years ago

Multiple issues again with the following test environments:

Using the ping monitor between vehicle and server to test round trip times.
All containers have been switched to run CycloneDDS with localhost only
- Localhost only restriction works and verified.
Running the zenoh master container aside the pinging container.

(vehicle) Ping monitor <-> Zenoh bridge <-> Zenoh bridge <-> Ping monitor (server)

vehicle

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp CYCLONEDDS_URI=fastrtps_profiles.xml ./zenoh-bridge-dds -d 0 -l tcp/0.0.0.0:7447 -f -e tcp/192.168.10.80:7447 --no-multicast-scouting

server

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp CYCLONEDDS_URI=fastrtps_profiles.xml ./zenoh-bridge-dds -d 0 -l tcp/0.0.0.0:7447 -f --no-multicast-scouting

CycloneDDS config file (named fastrtps_profiles.xml for efficiency)

<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
  <Domain id="any">
      <General>
          <NetworkInterfaceAddress>lo</NetworkInterfaceAddress>
          <AllowMulticast>false</AllowMulticast>
      </General>
      <Discovery>
          <ParticipantIndex>auto</ParticipantIndex>
          <Peers>
              <Peer Address="localhost"/>
          </Peers>
          <MaxAutoParticipantIndex>120</MaxAutoParticipantIndex>
      </Discovery>
  </Domain>
</CycloneDDS>

Observations:

On startup, if both are started there is no communication
If the vehicle zenoh is restarted, we get communication for around 30 seconds before it stops for some unknown reason.
- The -f is required for it to work, without the -f it can;t find the topics.
- We can add the --group-lease <duration> with duration set to 10 (instead of the default of 3), But that drastically reduces the frequency of RTT packets to a 5th of what it should be (from 5hz -> 1hz) for some unknown reason.

After several days of testing with no good results, I am binning the use of zenoh.

Will try using fast-rtps discovery server as a last ditch attempt.

mhl787156 commented 2 years ago

Can confirm that FASTDDS discovery server does work BUT there are two major drawbacks

Each client (vehicle) must set the env var ROS_DISCOVERY_SERVER=<server_ip>:11811 where the server_ip must be set to the ip address of the server. The hostname (e.g. flyingserver.local) does not work. This means that users must configure this value between use in Kind and in the flight arena.
The CLI tools do not work with the discovery server See this PR. Current workaround is to export a particular fastrtps configuration file when you want to use the cli tools for introspection. Instructions are here.

Both of these are highly not-ideal as it is difficult to produce general configurations. They also require a user to do extra non-intuitive steps which were not previously required.

@rob-clarke any thoughts on what we might be able to do?

mhl787156 commented 2 years ago

Thoughts for the first point, we can add ROS_DISCOVERY_SERVER to vehicle config so that it can be pre-configured by us for KIND vs Flight Arena.

CLI is still a problem - maybe there needs to be a Debug container which contains the fastrtps config file.

mhl787156 commented 2 years ago

Going back to without discovery server for a bit to work out what is going on.

PX4 Broadcasting

First have identified that PX4 broadcasting via Mavros accounted for a large portion of broadcast (dest 255.255.255.255) packets (300+ packets/s per vehicle) on the network. Compared to base level of packets (10 -30 packets/s) this feels significant, and at the very least is contributing to quicker multicast degradation if that is happening.

Screenshot from 2022-08-13 20-43-46

Have resolved this by disabling broadcast and changing the MAVROS_GCS_URL=udp-pb://@:14550 to udp://@:14550. QGC can connect directly to the vehicle using a udp connection and base URL of <vehicle ip>:14555 (default port on remote according to mavros).

Vicon traffic

Second thing identified is that there appears to be double to vicon traffic compared to expected. Currently we expect that vicon will be broadcast at 30hz (30 packets/s) from the vicon machine to 255.255.255.255, and indeed that can be seen in the image below (the yellow line). In the image, at about half way I switch vicon on and you see a step in overall traffic (black line approx 60packets/s).

Screenshot from 2022-08-17 15-49-05

The question is where that extra 30 packets come from as they are being sent from 192.168.10.10 -> 192.168.10.255. See the logs on the left. On further inspection they look like identification packets from the cameras - plotting bytes we see that vicon requires a 32ish kiloByte/s stream.

Screenshot from 2022-08-17 18-15-28

Multiple vehicles x2

Good news is that RTTs from multiple vehicles does stabilise between 1.5ms - 5ms with occasional spikes.

mhl787156 commented 2 years ago

Network Bandwidth

Further to the previous comment, I have been testing system bandwidths and have identified that we are not close to hitting any limits. Testing using iperf3over the network between devices.

Raspberry Pi network bandwidth looks to be capped between 45 to 55 Mbits/s.

Wired connection is confirmed to be 1Gbit/s

~~Also have verified router settings and there appear to be no multicast related settings to change~~

After some reading I have identified that low data rates appear to be detrimental to Multicast. Therefore I have enabled higher data rates. Multicast will however use the lowest enabled datarate. For future, lots of people recommend Multicast-To-Unicast support on Wifi Routers.

StarlingUAS / ProjectStarling