ROS2 micrortps_bridge is not scaling to additional vehicles in Gazebo

I am trying to control multiple vehicles in Gazebo with a set of ROS2 nodes. Each vehicle is controlled by a set of six ROS2 nodes. I have followed the multi vehicle simulation with gazebo tutorial and can launch multiple iris drones. I have also followed the RTPS/DDS Interface: PX4-Fast RTPS(DDS) Bridge tutorial and can control a few (i.e., 1-3) iris drones through the micrortps_bridge with my ROS2 nodes.

However, when I try to scale to additional drones, I begin to run into errors. I will describe (a) my setup, (b) the errors I am experiencing, (c) what I have tried to do fix those errors, and (d) the results from some interesting experiments that lead me to believe there is an issue with the micrortps_bridge.

Setup

In Terminal 1, I start the PX4-Gazebo simulator:

./gazebo_sitl_multiple_run.sh -n 3 -t px4_sitl_rtps

In Terminal 2, I start the micrortps_agents:

micrortps_agent -t UDP -r 2020 -s 2019 -n vehicle1 -v & 
micrortps_agent -t UDP -r 2022 -s 2021 -n vehicle2 -v & 
micrortps_agent -t UDP -r 2024 -s 2023 -n vehicle3 -v &

In Terminals 3-5, I run a launch file. This launch file starts six ROS2 nodes that collectively control an individual drone:

AGENT_NAMESPACE=vehicle1 ros2 launch start_system.py
AGENT_NAMESPACE=vehicle2 ros2 launch start_system.py
AGENT_NAMESPACE=vehicle3 ros2 launch start_system.py

The ROS2 nodes communicate to the vehicles through the micrortps_bridge and everything works fine.

Errors

When I repeat the above steps, but modified to handle 4 drones, Terminal 2 begins to print messages like this:

[   micrortps_agent   ] VehicleOdometry publisher matched
[ micrortps__timesync ] RTTI too high for timesync: 103ms
[ micrortps__timesync ] Offset not updated
[ micrortps__timesync ] Timesync offset outlier, discarding
[ micrortps__timesync ] Offset not updated
[ micrortps__timesync ] Timesync offset outlier, discarding
[ micrortps__timesync ] Offset not updated

Terminal 3-5 still run correctly. But Terminal 6 (for drone 4) will print a warning from one of my ROS2 nodes (i.e., swarm_comms) that one of its services (/swarm_comms/swarm_status) is "not available".

Attempted fixes

In urtps_bridge_topics.yaml, there are on the order of 27 messages passed over the micrortps_bridge. If I reduce that to only 10 messages (i.e., the messages my ROS2 nodes use + the timesync and timesync_status messages), and recompile the PX4-Autopilot repo as well as my repo with the px4_msgs and px4_ros_com submodules, I am able to partially remedy the problem described above. Instead of only getting three drones to work, I am now able to get four drones to work.

If I repeat the steps above, but modified to support five drones, I get the same errors described above. This leads me to believe that there is a relationship between the micrortps_bridge throughput and the number of vehicles supported.

I have gone into the px4_ros_com code and modified the limits/logic associated with the "RTTI too high for timesync" and "timesync offset outlier" errors. Instead of an rtti > 50 ms , I changed it to an rtti > 500 ms. I also increased the WINDOW_SIZE from 500 to 5000. I don't get the errors on Terminal 2, but I still get the "/swarm_comms/swarm_status service not available" error.

Experiments

To rule out my computer resources being overwhelmed by the number of ROS2 nodes and ROS2 messages being passed around, I checked gnome-system-monitor. CPU utilization, Memory and Network history all showed low values.

To rule out my the swarm_comms ROS2 node being unprepared for /swarm_comms/swarm_status service calls, I added a delay in all other ROS2 nodes to give swarm_comms time to start up before service calls were made. No change in behavior.

To rule out the /swarm_comms/swarm_status service call from being available, I create a new callback group specifically for that service, but this did not change the behavior:

service_callback_group = MutuallyExclusiveCallbackGroup()
self.create_service(Status, 'swarm_comms/swarm_status', self.get_swarm_status, callback_group=service_callback_group)

To rule out ROS2 having some upper limit on the number of ROS2 rosnodes running simultaneously on the same machine (or something else networking-related), I ran the following experiment:

On Terminal 1, I started up the sim with 7 vehicles
On Terminal 2, I did nothing. I did NOT start the micrortps_agents
On Terminals 3-9, I ran the launch file that starts up an instance of the six ROS2 nodes for each of the 7 drones.

Everything worked fine. There were no errors. The /swarm_comms/swarm_status service call worked properly. ROS2 nodes are able to exchange information intra-vehicle, as well as inter-vehicle, as expected.

To drill into the root cause of the issue, I repeated the steps in the previous paragraph, but started the micrortps_agents on Terminal 2. As expected, Terminal 2 printed the same errors as above: "RTTI too high for timesync" and "Timesync offset outlier". When I started Terminal 3 (for drone 1), it ran correctly. When I started Terminal 4 (for drone 2), it printed the "/swarm_comms/swarm_status service not available" error.
Earlier, when I ran these steps (but which were modified for 3 drones in sim + 3 micrortps_agents), everything worked properly. Now, however, with 7 drones in sim + 7 micrortps_agents, it is only Terminal 3 (for drone 1) that works properly. Terminal 4 (for drone 2) produces the "/swarm_comms/swarm_status service not available" error. This leads me to believe that there is a relationship between the micrortps_bridge throughput and the number of vehicles supported.

To further drill into this issue, I modified my launch file (i.e., start_system.py) to only start two rosnodes: swarm_comms + sle (which calls the swarm_comms/swarm_status service) instead of the normal 6 rosnodes. In Terminal 1, I started 7 drones in sim. In Terminal 2, I started 7 micrortps_agents. Terminals 3-6 (for drones 1-4) ran successfully. Terminals 7-9 (for drones 5-7) produced the “swarm_comms/swarm_status service not available” error.
I presume that since only two of my ROS2 nodes are running (instead of all six), they are only requesting/sending a subset of the 10 uorb messages specified in urtps_bridge_topics.yaml. This would result in less data being sent over the micrtortps_bridge. Since this experiment allow for more drones to work, this further leads me to believe that there is a relationship between the micrortps_bridge throughput and things working properly.

The results from my experiments point to there being a problem with the micrortps_bridge supporting multiple vehicles. I am not sure whether it is the micrortps_client, micrortps_agent or something else.

Anything that can be shared to help me understand how to control 7+ vehicles over the micrortps_bridge with ROS2 would be very appreciated!

PX4 / PX4-SITL_gazebo-classic