Open glpuga opened 1 month ago
Qualitatively this became much more noticeable after instrumenting measurements with the timememory. While a previous run of the run without timememory got stuck once during the four (effective) days run, so far I've had to restart it five times and I'm only halfway through the same bagfiles set.
A limited set of logs I observed seem to have these in common:
I'll try this: https://github.com/ros2/rmw_fastrtps/pull/704
Bug description
While running a large benchmark run testing beluga, lambkin got stuck during a case and never recovered.
How to reproduce
No idea.
Expected behavior
Continue to run until the final case.
Actual behavior
About two days into the run, it stopped moving forward. ROS nodes where up, but nothing relevant was logged, and output bagfile was empty.
Additional context
No resources were obviously missing in the computer, there was enough disk space, and the computer (beefy) was otherwise idle.
These are the logs of the final few cases/iterations leading to the stop. I removed the bagfiles due to their size, but all but the last one were of the expected size. The one of the iteration that got stuck was empty, like nothing had been recorded since the iteration started.
tor_wic_slam_error.tar.gz