Title: Scaling micro-rtps agents to 4+ vehicles

francescofraternali commented 2 years ago

Description: We want to run a multi-vehicle PX4-Gazebo simulation where each drone is in offboard mode and controlled via ROS2 nodes. We have used the ROS2 offboard control example and microRTPS bridge instructions to setup a microRTPS bridge to ROS2. We are able to get a working demo where two ROS2 nodes work together to control a single vehicle. The first rosnode brain directs the second rosnode (offboard_control) to perform the following sequence of actions: arm, takeoff, switch to offboard mode, fly to several waypoints, land. We are able to get this sequence of actions working with 1 vehicle in the simulation, 2 vehicles in the simulation, and 3 vehicles in the simulation (all flying simultaneously). However, as we continue to add additional vehicles, we begin to run into issues. In all cases, we perform the following sequence of events: We start the PX4-Gazebo simulator (in this command, with three vehicles):

./gazebo_sitl_multiple_run.sh -n 3 -t px4_sitl_rtps

We start micrortps_agents (in this case, for three vehicles):

micrortps_agent -t UDP -r 2020 -s 2019 -n vehicle1 -v &
micrortps_agent -t UDP -r 2022 -s 2021 -n vehicle2 -v &
micrortps_agent -t UDP -r 2024 -s 2023 -n vehicle3 -v &

We open terminals, each of which starts the two ROS2 nodes mentioned earlier (i.e., brain, and offboard_control). In this case, there are three terminals for three vehicles:

AGENT_NAMESPACE=vehicle1 SIM=1 ros2 launch drone_launch.py
AGENT_NAMESPACE=vehicle2 SIM=1 ros2 launch drone_launch.py
AGENT_NAMESPACE=vehicle3 SIM=1 ros2 launch drone_launch.py

When we repeat these steps with a fourth vehicle (i.e., ./gazebo_sitl_multiple_run.sh -n 4 -t px4_sitl_rtps + micrortps_agent -t UDP -r 2026 -s 2025 -n vehicle4 -v & + AGENT_NAMESPACE=vehicle4 SIM=1 ros2 launch drone_launch.py), many warnings begin to appear on the micrortps terminals:

[   micrortps_agent   ] VehicleOdometry publisher matched
[ micrortps__timesync ] RTTI too high for timesync: 102ms
[ micrortps__timesync ] Offset not updated
[ micrortps__timesync ] Timesync offset outlier, discarding
[ micrortps__timesync ] Offset not updated
[ micrortps__timesync ] Timesync offset outlier, discarding
[ micrortps__timesync ] Offset not updated
[ micrortps__timesync ] Timesync offset outlier, discarding
[ micrortps__timesync ] Offset not updated
[ micrortps__timesync ] Timesync offset outlier, discarding
[ micrortps__timesync ] Offset not updated
[ micrortps__timesync ] Timesync offset outlier, discarding
[ micrortps__timesync ] Offset not updated
[ micrortps__timesync ] Timesync offset outlier, discarding
[ micrortps__timesync ] Offset not updated
[ micrortps__timesync ] Timesync clock changed, resetting
[   micrortps_agent   ] OffboardControlMode subscriber unmatched

When we start the ROS2 nodes that control the vehicles, they do not execute the sequence of actions described earlier. Instead, we get a message from the brain rosnode, indicating that a service from the offboard_control rosnode is unavailable. (With 1, 2, and 3 vehicles, we do not get this message. Instead, the rosservice is available and everything works.) We have attempted this on two different machines and get similar results:

The first is a laptop with an Intel Core i9, with 16 cores and 32 GB of RAM. When the above simulation is running, with all the corresponding terminals, these resources are minimally used.
The second is a Dell workstation with an Intel Core i9 with 32 cores and 64 GB of RAM. It also has a GeForce RTX 3080 GPU. When the above simulation is running, with all the corresponding terminals, these resources are minimally used.
For both machines, we are using Ubuntu 20.04, ROS2 foxy, and python3.8.

We have not been able to determine the cause of this behavior. Is there a reason why the micrortps-agent is not able to manage the 4th drone? Is there a better way to do this? (We would like to keep scaling our code to more than 4 drones -- such as 10). Please let us know if additional information is needed. We appreciate any help or guidance!

docs.px4.iodocs.px4.io ROS 2 Offboard Control Example | PX4 User Guide PX4 is the Professional Autopilot. Developed by world-class developers from industry and academia, and supported by an active world wide community, it powers all kinds of vehicles from racing and cargo drones through to ground vehicles and submersibles. docs.px4.iodocs.px4.io RTPS/DDS Interface: PX4-Fast RTPS(DDS) Bridge | PX4 User Guide PX4 is the Professional Autopilot. Developed by world-class developers from industry and academia, and supported by an active world wide community, it powers all kinds of vehicles from racing and cargo drones through to ground vehicles and submersibles.

UPDATE: By looking at the figure below, we actually noticed that we initialize many micrortps_agents (i.e., one for each drone), but there is only one micrortps_client. Therefore, as soon as we run multiple drones, there is a fair amount of data flowing over the micrortps bridge (there are ~28 messages for each drone and some of those messages have a lot of fields.). And the amount of data sent to a single micrortps_client increases by adding more vehicles and more micrortps_agents. This is the reason why as soon as we add a 4th vehicle, the terminals with the rtps_agents would have these warnings: [ micrortpstimesync ] RTTI too high for timesync: 150ms [ micrortpstimesync ] Offset not updated [ micrortpstimesync ] RTTI too high for timesync: 182ms [ micrortpstimesync ] Offset not updated

By reducing the number of px4_com_ros messages, we were able to run up to 5 vehicles correctly instead of 3 but as soon as we keep adding more vehicles, the warnings reappear and we are not able to scale our system and add more drones.

Therefore, is there a way to add more micrortps_clients to our system so that each drone will have its own micrortps_agent communicating with its specific micrortps_client and we can avoid this bottleneck?

Thank you very much for your time and availability

68747470733a2f2f6465762e7078342e696f2f76312e392e302f6173736574732f6d6964646c65776172652f6d6963726f727470732f6172636869746563747572655f726f732e706e67

jaredsjohansen commented 2 years ago

same issue here

dirksavage88 commented 2 years ago

@jaredsjohansen Regarding the OP's issue, we can trace where that warning/error message comes from regarding the RTTI being to high for timesync: https://github.com/PX4/PX4-Autopilot/blob/30e2490d5b50c0365052e00d53777b1f4068deab/msg/templates/urtps/microRTPS_timesync.cpp.em#L199

An explanation for the function parameters of TimeSync::addMeasurement(): https://github.com/PX4/PX4-Autopilot/blob/30e2490d5b50c0365052e00d53777b1f4068deab/msg/templates/urtps/microRTPS_timesync.h.em#L177

It appears that the member variable '_rtti' stores an offset based on the agent current CLOCK_MONOTONIC time - agent CLOCK_MONOTONIC_RAW when message was sent, or "_rtti = local_t3_ns - local_t1_ns;"

However the diff between these two timestamps is too large, over 100 ms in your case...therefore the timsync offset is not applied. The function then returns false (as it's a boolean return value).

Is it correct to say that this is a many-to-many scenario with agents and clients? There is a P2P mode for eProsima's xrce-ddcs but I'm not sure if this will allow you to timesync correctly with more than a few agents and clients simultaneously. @dagar we were just talking about timesync in microdds...maybe this could offer a decent test scenario.

I'm not sure how the many-to-many scenario plays out exactly. Also, what is the reason for having multiple agents? The multiple clients should have an option to define a namespace if there is not a reason for many agents?

francescofraternali commented 2 years ago

@dirksavage88 thank you very much for your reply, we really appreciate your help.

I am not 100% sure what you mean when you refer to a "many-to-many scenario". What we are looking for are ROS2 nodes that talk to a single micrortps_agent, which, in turn, communicates with a specific px4 vehicle. Since we want to develop code for a swarm of drones, we need multiple vehicles relying on this mechanism. The following figure demonstrates what we are looking for:

temp

We followed the instructions in link to create multiple vehicles in a single environment. Per those same set of instructions, the simulator specifies the ports associated with each vehicle_id. When we run each micrortps_agent, we specify the sim ports and desired namespace. Since the micrortps_client starts automatically as part of the simulator, we do not know of a mechanism to specify a namespace for the (single) micrortps_client. We also do not know how to start multiple micrortps_clients, as it is not reported in the document. Therefore, what we believe happens is this: there is a single micrortps_client that starts (automatically) when the simulator starts. This single micrortps_client has to manage multiple agents. As soon as we scale the number of agents (beyond 3), the messages cannot cross over the micrortps bridge in a timely manner, which we believe is due to the single micrortps_client servicing all micrortps_agents. We believe this creates a bottleneck, as shown in the following figure:

temp_bott

Our bottleneck theory was confirmed by the following experiment. We reduced the number of px4_com_ros messages crossing the micrortps bridge. When we did that, we were able to command up to 5 vehicles instead of 3. Beyond five vehicles, however, the warnings reappear and we are not able to scale our system to additional drones.

ROS2 node communication relies on DDS, therefore I don't think we can apply the P2P mode for eProsima's xrce-ddcs that you mentioned. In the link, it states: "The peer-to-peer (P2P) mode allows direct communication between applications without DDS". Is this correct? Finally, you said, "The multiple clients should have an option to define a namespace if there is not a reason for many agents?" I am not aware of an ability to specify a namespace for the micrortps_clients (nor of an ability to start any beyond the one that is started by default). I think this is the key to solving our problem. Do you know how/if we can do that?

Thanks a lot for your time and availability!

beniaminopozzan commented 2 years ago

Hi @francescofraternali , are you sure that there is only one micrortps_client with multiple vehicles? Because looking at Tools/gazebo_sitl_multiple_run.sh it seems to me that an instance of px4 is started for each vehicle https://github.com/PX4/PX4-Autopilot/blob/dbf7d32e07887a402bdd72e092422a4cf3b2ae9d/Tools/gazebo_sitl_multiple_run.sh#L37 and I would expect that each of them starts its own micrortps_client.

francescofraternali commented 2 years ago

Thank you very much @beniaminopozzan for your reply.

I agree that the Tools/gazebo_sitl_multiple_run.sh script is starting multiple instances of px4. (The note on line 2 says as much.) However, I do not see evidence that it is starting multiple instances of the micrortps_client. As you can see in the attached screenshot of top, there are multiple instances of the micrortps_agent that I have started, but no evidence of multiple instances of the micrortps_client. In fact, I can't find any instances of the micrortps_client in top, which leads me to believe it is buried and part of another process running in PX4-Autopilot. Do you know how can I check if multiple microtps_clients are running? Thanks

Screenshot_top

dirksavage88 commented 2 years ago

It looks as though micrortps_client is running in the PX4 process and this makes sense. After all, SITL tried to mimic an FMU, which processes running on it wont be visible outside of mavlink shells or the system/debug console. The process can be seen within each PX4 sitl instance (pseudo-mavlink shell) but will not be made visible in the linux host system PS. If it was me, I would modify the script to allow access to each PX4 SITL shell, but maybe there as an easier way.

francescofraternali commented 2 years ago

Thank you very much @dirksavage88 for your reply. Can you please explain to me how to modify the Tools/gazebo_sitl_multiple_run.sh script to allow access to each PX4 SITL shell? That will be very helpful indeed as I can start/stop each micrortps_client process. Thanks

roseyanpeng commented 10 months ago

Hello, have you solved this problem? I also encountered a similar problem.

jaredsjohansen commented 10 months ago

Nope. Never solved the problem.

We resorted to using a mavlink messages (e.g., pymvalink) to communicate back-and-forth between our autonomy stack and the PX4 Autopilot.

roseyanpeng commented 10 months ago

Well, this problem has been bothering us for a long time. It seems that we can only try the new version. Anyway, thanks for your reply.

PX4 / px4_ros_com

Title: Scaling micro-rtps agents to 4+ vehicles #144