Lambkin got stuck during a benchmark run

glpuga commented 1 month ago

Bug description

While running a large benchmark run testing beluga, lambkin got stuck during a case and never recovered.

How to reproduce

No idea.

Expected behavior

Continue to run until the final case.

Actual behavior

About two days into the run, it stopped moving forward. ROS nodes where up, but nothing relevant was logged, and output bagfile was empty.

Additional context

No resources were obviously missing in the computer, there was enough disk space, and the computer (beefy) was otherwise idle.

These are the logs of the final few cases/iterations leading to the stop. I removed the bagfiles due to their size, but all but the last one were of the expected size. The one of the iteration that got stuck was empty, like nothing had been recorded since the iteration started.

tor_wic_slam_error.tar.gz

glpuga commented 2 weeks ago

Qualitatively this became much more noticeable after instrumenting measurements with the timememory. While a previous run of the run without timememory got stuck once during the four (effective) days run, so far I've had to restart it five times and I'm only halfway through the same bagfiles set.

A limited set of logs I observed seem to have these in common:

the logs never start, the amcl nodes never start processing data.
one node in the set (in one case nav2_amcl, in another the rosbag recorder), generates a log like this "failed to send response to /rosbag2_recorder/list_parameters (timeout)".

glpuga commented 2 weeks ago

I'll try this: https://github.com/ros2/rmw_fastrtps/pull/704

Ekumen-OS / lambkin