eclipse-zenoh / zenoh-plugin-ros2dds

A Zenoh plug-in for ROS2 with a DDS RMW.
https://zenoh.io
Other
111 stars 26 forks source link

[Bug] zenoh router takes 100% CPU after closing transient local subscriber #209

Open JMvanBruggen opened 1 month ago

JMvanBruggen commented 1 month ago

Describe the bug

Zenoh router takes 100% CPU usage after closing a transient local publisher/subscriber.

To reproduce

  1. Start a router: zenoh-bridge-ros2dds -m router
  2. Start a transient local publisher or subscriber. I recreated the rclcpp_minimal_subscriber example and rclcpp_minimal_publisher example, only changing the QoS to rclcpp::QoS(1).transient_local()
  3. Stop the publisher/subscriber

Router keeps functioning but CPU usage spikes to 100% and stays that way. No warning or error messages. I have to restart the router to fix it. But everytime a node with a transient local publisher/subscriber fails or respawns this happens. I first noticed this on the client side by the way, but that has an extra reproduction step, not sure if is caused by the same issue or needs a new item.

  1. Start a router: zenoh-bridge-ros2dds -m router
  2. Start a client zenoh-bridge-ros2dds -e tcp/ROBOT_IP:7447 -m client
  3. Start a transient local subscriber or publisher on the client side.
  4. Stop the publisher/subscriber

In this case there are warnings on the router side:

2024-08-06T10:29:17.076170Z  INFO async-std/runtime ThreadId(41) zenoh_plugin_ros2dds: Remote bridge 24c64d9fc79effddd3916e2ebad14184 retires Subscriber parameter_events
2024-08-06T10:29:17.076190Z  INFO async-std/runtime ThreadId(41) zenoh_plugin_ros2dds: Remote bridge 24c64d9fc79effddd3916e2ebad14184 retires Service Server minimal_subscriber/list_parameters
2024-08-06T10:29:17.076362Z  WARN async-std/runtime ThreadId(41) zenoh_plugin_ros2dds::route_service_cli: Route Service Client (ROS:/minimal_subscriber/list_parameters <-> Zenoh:minimal_subscriber/list_parameters): Error getting GUID of DDS entity - retcode=-3
2024-08-06T10:29:17.076506Z  INFO async-std/runtime ThreadId(41) zenoh_plugin_ros2dds::routes_mgr: Route Service Client (ROS:/minimal_subscriber/list_parameters <-> Zenoh:minimal_subscriber/list_parameters) removed
2024-08-06T10:29:17.078333Z  INFO async-std/runtime ThreadId(41) zenoh_plugin_ros2dds: Remote bridge 24c64d9fc79effddd3916e2ebad14184 retires Service Server minimal_subscriber/describe_parameters
2024-08-06T10:29:17.078537Z  WARN async-std/runtime ThreadId(41) zenoh_plugin_ros2dds::route_service_cli: Route Service Client (ROS:/minimal_subscriber/describe_parameters <-> Zenoh:minimal_subscriber/describe_parameters): Error getting GUID of DDS entity - retcode=-3
2024-08-06T10:29:17.078673Z  INFO async-std/runtime ThreadId(41) zenoh_plugin_ros2dds::routes_mgr: Route Service Client (ROS:/minimal_subscriber/describe_parameters <-> Zenoh:minimal_subscriber/describe_parameters) removed
2024-08-06T10:29:17.081728Z  INFO async-std/runtime ThreadId(40) zenoh_plugin_ros2dds: Remote bridge 24c64d9fc79effddd3916e2ebad14184 retires Service Server minimal_subscriber/set_parameters_atomically
2024-08-06T10:29:17.081874Z  WARN async-std/runtime ThreadId(40) zenoh_plugin_ros2dds::route_service_cli: Route Service Client (ROS:/minimal_subscriber/set_parameters_atomically <-> Zenoh:minimal_subscriber/set_parameters_atomically): Error getting GUID of DDS entity - retcode=-3
2024-08-06T10:29:17.082007Z  INFO async-std/runtime ThreadId(40) zenoh_plugin_ros2dds::routes_mgr: Route Service Client (ROS:/minimal_subscriber/set_parameters_atomically <-> Zenoh:minimal_subscriber/set_parameters_atomically) removed
2024-08-06T10:29:17.083851Z  INFO async-std/runtime ThreadId(40) zenoh_plugin_ros2dds: Remote bridge 24c64d9fc79effddd3916e2ebad14184 retires Service Server minimal_subscriber/set_parameters
2024-08-06T10:29:17.084034Z  WARN async-std/runtime ThreadId(40) zenoh_plugin_ros2dds::route_service_cli: Route Service Client (ROS:/minimal_subscriber/set_parameters <-> Zenoh:minimal_subscriber/set_parameters): Error getting GUID of DDS entity - retcode=-3
2024-08-06T10:29:17.084169Z  INFO async-std/runtime ThreadId(40) zenoh_plugin_ros2dds::routes_mgr: Route Service Client (ROS:/minimal_subscriber/set_parameters <-> Zenoh:minimal_subscriber/set_parameters) removed
2024-08-06T10:29:17.085890Z  INFO async-std/runtime ThreadId(40) zenoh_plugin_ros2dds: Remote bridge 24c64d9fc79effddd3916e2ebad14184 retires Service Server minimal_subscriber/get_parameter_types
2024-08-06T10:29:17.086024Z  WARN async-std/runtime ThreadId(40) zenoh_plugin_ros2dds::route_service_cli: Route Service Client (ROS:/minimal_subscriber/get_parameter_types <-> Zenoh:minimal_subscriber/get_parameter_types): Error getting GUID of DDS entity - retcode=-3
2024-08-06T10:29:17.086128Z  INFO async-std/runtime ThreadId(40) zenoh_plugin_ros2dds::routes_mgr: Route Service Client (ROS:/minimal_subscriber/get_parameter_types <-> Zenoh:minimal_subscriber/get_parameter_types) removed
2024-08-06T10:29:17.088391Z  INFO async-std/runtime ThreadId(40) zenoh_plugin_ros2dds: Remote bridge 24c64d9fc79effddd3916e2ebad14184 retires Service Server minimal_subscriber/get_parameters
2024-08-06T10:29:17.088465Z  WARN async-std/runtime ThreadId(40) zenoh_plugin_ros2dds::route_service_cli: Route Service Client (ROS:/minimal_subscriber/get_parameters <-> Zenoh:minimal_subscriber/get_parameters): Error getting GUID of DDS entity - retcode=-3
2024-08-06T10:29:17.088484Z  INFO async-std/runtime ThreadId(40) zenoh_plugin_ros2dds::routes_mgr: Route Service Client (ROS:/minimal_subscriber/get_parameters <-> Zenoh:minimal_subscriber/get_parameters) removed

System info

aosmw commented 1 month ago

I see 100% cpu often but don't know what triggers it.

I flamegraphed it when I caught it. Maybe we can compare flamegraphs and give the maintainers a hint.

# Install perf
sudo apt install linux-tools-generic
cargo install flamegraph

# Read the help
flamegraph -h

# Use sudo to run flamegraph
flamegraph --root -p $(pgrep zenoh-bridge-ro)

# Press Ctrl+c after 20-30sec
# a flamegraph.svg file is created

# Open flamegraph.svg in a browser and attach it here.

flamegraph

NOTE: Its an interactive svg but appears that github view is doing something to prevent the interactivity. NOTE2: dpkg-query --show zenoh-bridge-ros2dds zenoh-bridge-ros2dds 0.11.0-stable