This problem, which very rarely happens, has haunted me for the last couple of years. The last time it happened was more than 2 years ago, and recently I got hit by it again.
The coredump of segmentation fault is as follows:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fae1b30b03e in octomap::OcTreeBaseImpl<octomap::OcTreeNode, octomap::AbstractOccupancyOcTree>::deleteNodeRecurs (this=0x7fad5c5ed420, node=0x7fad00000003) at /opt/ros/noetic/include/octomap/OcTreeBaseImpl.hxx:662
Thread 1 (Thread 0x7fad857fa700 (LWP 508197)):
#0 0x00007fae1b30b03e in octomap::OcTreeBaseImpl<octomap::OcTreeNode, octomap::AbstractOccupancyOcTree>::deleteNodeRecurs(octomap::OcTreeNode*) (this=0x7fad5c5ed420, node=0x7fad00000003) at /opt/ros/noetic/include/octomap/OcTreeBaseImpl.hxx:662
... bunch of recursive deleteNodeRecurs calls...
#15 0x00007fae1b30b061 in octomap::OcTreeBaseImpl<octomap::OcTreeNode, octomap::AbstractOccupancyOcTree>::deleteNodeRecurs(octomap::OcTreeNode*) (this=0x7fad5c5ed420, node=0x7fad5d0725d0) at /opt/ros/noetic/include/octomap/OcTreeBaseImpl.hxx:662
#16 0x00007fae1b30b107 in octomap::OcTreeBaseImpl<octomap::OcTreeNode, octomap::AbstractOccupancyOcTree>::clear() (this=0x7fad5c5ed420) at /opt/ros/noetic/include/octomap/OcTreeBaseImpl.hxx:512
#17 octomap::OcTreeBaseImpl<octomap::OcTreeNode, octomap::AbstractOccupancyOcTree>::~OcTreeBaseImpl() (this=0x7fad5c5ed420, __in_chrg=<optimized out>) at /opt/ros/noetic/include/octomap/OcTreeBaseImpl.hxx:69
#18 0x00007fae0929e5ac in fcl::OcTree<double>::~OcTree() () at /opt/ros/noetic/lib/x86_64-linux-gnu/libfcl.so.0.6
#19 0x00007fae12fb9040 in std::_Sp_counted_ptr<collision_detection::FCLGeometry*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () at /opt/ros/noetic/lib/libmoveit_collision_detection_fcl.so.1.1.13
#20 0x00007fae12fbba2c in std::_Rb_tree<std::weak_ptr<shapes::Shape const>, std::pair<std::weak_ptr<shapes::Shape const> const, std::shared_ptr<collision_detection::FCLGeometry const> >, std::_Select1st<std::pair<std::weak_ptr<shapes::Shape const> const, std::shared_ptr<collision_detection::FCLGeometry const> > >, std::owner_less<std::weak_ptr<shapes::Shape const> >, std::allocator<std::pair<std::weak_ptr<shapes::Shape const> const, std::shared_ptr<collision_detection::FCLGeometry const> > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<std::weak_ptr<shapes::Shape const> const, std::shared_ptr<collision_detection::FCLGeometry const> > >) () at /opt/ros/noetic/lib/libmoveit_collision_detection_fcl.so.1.1.13
#21 0x00007fae12fb1519 in collision_detection::cleanCollisionGeometryCache() () at /opt/ros/noetic/lib/libmoveit_collision_detection_fcl.so.1.1.13
#22 0x00007fae1ab69440 in collision_detection::World::notify(std::shared_ptr<collision_detection::World::Object const> const&, collision_detection::World::Action) () at /opt/ros/noetic/lib/libmoveit_collision_detection.so.1.1.13
#23 0x00007fae1ab696b7 in collision_detection::World::removeObject(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () at /opt/ros/noetic/lib/libmoveit_collision_detection.so.1.1.13
#24 0x00007fae1aae5b96 in planning_scene::PlanningScene::processCollisionObjectRemove(moveit_msgs::CollisionObject_<std::allocator<void> > const&) () at /opt/ros/noetic/lib/libmoveit_planning_scene.so.1.1.13
#25 0x00007fae1aaf8f0b in planning_scene::PlanningScene::processCollisionObjectMsg(moveit_msgs::CollisionObject_<std::allocator<void> > const&) () at /opt/ros/noetic/lib/libmoveit_planning_scene.so.1.1.13
#26 0x00007fae1aaf9392 in planning_scene::PlanningScene::setPlanningSceneDiffMsg(moveit_msgs::PlanningScene_<std::allocator<void> > const&) () at /opt/ros/noetic/lib/libmoveit_planning_scene.so.1.1.13
I have spent a significant time to find the root cause of the problem, but there is still one missing piece, which I would love to have a helping hand. Before that piece, let me ask two important introductory questions:
Q: How the octomap::OcTree inside the FCL Shape Cache go into destruction?
A: If we would have used octomap_monitor_->getOcTreePtr() everywhere to retrieve a shared_ptr to the monitored octomap::OcTree (collision_detection::OccMapTree to be pedantic in our case, which extends octomap::OcTree), then such a destruction would not happen. But that is not the case. There are two criticalplaces that construct a whole different collision_detection::OccMapTree object. If this object's ref count reaches 0, the destruction stage initiates. This explains how.
Q: Why the octomap::OcTree inside the FCL Shape Cache go into destruction?
A: It takes at least two planning_scene::PlanningScene objects with different lifetimes. The following pseudocode illustrates it:
auto scene_of_psm = psm->getPlanningScene();
moveit_msgs::PlanningScene msg; <A diff planning_scene msg with a non-empty octomap>
scene_of_psm->setPlanningSceneDiffMsg(msg);
ASSERT_TRUE(scene->getWorld()->hasObject(planning_scene::PlanningScene::OCTOMAP_NS));
At this time, FCLShapeCache contains a key-value pair as shape->octree of scene_of_psm, let's call them sOT1->OT1.
moveit_msgs::PlanningScene scene_msg;
{
planning_scene_monitor::LockedPlanningSceneRO locked_scene_ro(psm);
locked_scene_ro->getPlanningSceneMsg(scene_msg);
}
{
planning_scene::PlanningScenePtr other_scene =
std::make_shared<planning_scene::PlanningScene>(psm->getRobotModel());
// Creates sOT2 and OT2, also updates the FCL Shape cache with them.
ASSERT_TRUE(other_scene->usePlanningSceneMsg(scene_msg));
ASSERT_TRUE(other_scene->getWorld()->hasObject(planning_scene::PlanningScene::OCTOMAP_NS));
}
FCLShapeCache now contains a key-value pair as shape->octree of other_scene, which are sOT2->OT2. Now, there is an important bit. Beware that other_scene went out of scope. The shape corresponding to its unique octree (i.e. sOT2) reached the ref_cnt=0. The FCLShapeCache will notice it when cleanCollisionGeometryCache is called, but not now. The following statement triggers the cache invalidation chain reaction that can be seen in the coredump:
cleanCollisionGeometryCachecalls for a forced cleanup of expired key-value pairs. Expiration is checked on the key, a const weak shared_ptr to shapes::Shape polymorphic types (in our case sOT2). The destruction happens on the value, which is of type FCLGeometryConstPtr, which is a wrapper on top of OT2. This explains why. Following logs also show the flow of events:
[ INFO] [1724780485.232404095]: Created parent shape for octree at 0x562d03a5fb80
[ INFO] [1724780485.232447396]: Collision data structures for <octomap> retrieved from cache with root node: 0x562d03a5bb50
[ INFO] [1724780485.232471962]: Octree ptr addr from obj shape: 0x562d03a5fb80 - with total sp-count: 2 - with root node: 0x562d03a5bb50 - with parent shape addr: 0x562d03a5b230 - with parent shape sp-count: 2
[ INFO] [1724780485.232858935]: Created parent shape for octree at 0x562d03900340
[ INFO] [1724780485.232905622]: Collision data structures for <octomap> retrieved from cache with root node: 0x562d03a5f550
[ INFO] [1724780485.232939095]: Octree ptr addr from obj shape: 0x562d03900340 - with total sp-count: 2 - with root node: 0x562d03a5f550 - with parent shape addr: 0x562d038fd060 - with parent shape sp-count: 2
[ INFO] [1724780485.233034593]: Parent shape of <octomap> with root node: 0x562d03a5f550 seems to have expired. Removing from the cache!
Now comes the missing piece. For the abovementioned coredump to happen, there has to be a simultaneous write operation going on OT2via a raw pointer (if there is another shared pointer instance, then the destruction would not start since ref_count would not be 0) at the time of destruction, such that a prune or delete creates a data race. Nevertheless, I could not find such a usage in MoveIt even though I stringently inspected the codebase. All octomap::OcTreeusages are via shared pointers and they are even const. If anyone can think of a way that this could happen, I'd love to hear it.
The only remaining possibility I see is a broken collision_detection::OccMapTree at the time of construction in createOctomap method such that a NULL child node slept in the OcTree until destruction.
How to fix?
Whenever a collision_detection::OccMapTree is required, use the one from OccupancyMapMonitor with octomap_monitor_->getOcTreePtr(). This way the ref count never reaches 0, which means no destruction.
Besides, there is an additional problem in allowing other instances of collision_detection::OccMapTree. See the following snippet:
auto scene_of_psm = psm->getPlanningScene(); // Assume it already contains an octomap shape constructed with the octree pointer received from octomap_monitor_
moveit_msgs::PlanningScene msg; <A diff planning_scene msg with a non-empty octomap shape>
scene_of_psm->setPlanningSceneDiffMsg(msg);
The first line refers to here. However, at the end a different collision_detection::OccMapTree than the monitored one resides in the octomap shape. Typically, we trigger a scene update event after overwriting the scene of PlanningSceneMonitor:
psm->triggerSceneUpdateEvent(UPDATE_SCENE);
which causes the scenePublishingThread to run one cycle. In its loop body, it acquires a read lock on the monitored octree, however the scene_of_psm actually contains a different octree, which makes that lock useless. Furthermore, locking the monitored octree whenever a read operation takes place on PlanningSceneMonitors scene_ shows that this is an implicit assumption in the codebase: Any octomap shape in the planning_scene::PlanningScene of PlanningSceneMonitor contains a shared_ptr to the monitored octree, which currently does not hold. The proposed fix would make this assumption valid again.
This problem, which very rarely happens, has haunted me for the last couple of years. The last time it happened was more than 2 years ago, and recently I got hit by it again.
The coredump of segmentation fault is as follows:
I have spent a significant time to find the root cause of the problem, but there is still one missing piece, which I would love to have a helping hand. Before that piece, let me ask two important introductory questions:
Q: How the
octomap::OcTree
inside the FCL Shape Cache go into destruction? A: If we would have usedoctomap_monitor_->getOcTreePtr()
everywhere to retrieve a shared_ptr to the monitoredoctomap::OcTree
(collision_detection::OccMapTree
to be pedantic in our case, which extendsoctomap::OcTree
), then such a destruction would not happen. But that is not the case. There are two critical places that construct a whole differentcollision_detection::OccMapTree
object. If this object's ref count reaches 0, the destruction stage initiates. This explains how.Q: Why the
octomap::OcTree
inside the FCL Shape Cache go into destruction? A: It takes at least twoplanning_scene::PlanningScene
objects with different lifetimes. The following pseudocode illustrates it:At this time,
FCLShapeCache
contains a key-value pair as shape->octree ofscene_of_psm
, let's call themsOT1
->OT1
.FCLShapeCache
now contains a key-value pair as shape->octree ofother_scene
, which aresOT2
->OT2
. Now, there is an important bit. Beware thatother_scene
went out of scope. The shape corresponding to its unique octree (i.e.sOT2
) reached the ref_cnt=0. TheFCLShapeCache
will notice it whencleanCollisionGeometryCache
is called, but not now. The following statement triggers the cache invalidation chain reaction that can be seen in the coredump:cleanCollisionGeometryCache
calls for a forced cleanup of expired key-value pairs. Expiration is checked on the key, a const weak shared_ptr toshapes::Shape
polymorphic types (in our casesOT2
). The destruction happens on the value, which is of typeFCLGeometryConstPtr
, which is a wrapper on top ofOT2
. This explains why. Following logs also show the flow of events:Now comes the missing piece. For the abovementioned coredump to happen, there has to be a simultaneous write operation going on
OT2
via a raw pointer (if there is another shared pointer instance, then the destruction would not start since ref_count would not be 0) at the time of destruction, such that a prune or delete creates a data race. Nevertheless, I could not find such a usage in MoveIt even though I stringently inspected the codebase. Alloctomap::OcTree
usages are via shared pointers and they are evenconst
. If anyone can think of a way that this could happen, I'd love to hear it.The only remaining possibility I see is a broken
collision_detection::OccMapTree
at the time of construction in createOctomap method such that a NULL child node slept in the OcTree until destruction.How to fix?
Whenever a
collision_detection::OccMapTree
is required, use the one from OccupancyMapMonitor withoctomap_monitor_->getOcTreePtr()
. This way the ref count never reaches 0, which means no destruction.Besides, there is an additional problem in allowing other instances of
collision_detection::OccMapTree
. See the following snippet:The first line refers to here. However, at the end a different
collision_detection::OccMapTree
than the monitored one resides in the octomap shape. Typically, we trigger a scene update event after overwriting the scene ofPlanningSceneMonitor
:which causes the
scenePublishingThread
to run one cycle. In its loop body, it acquires a read lock on the monitored octree, however thescene_of_psm
actually contains a different octree, which makes that lock useless. Furthermore, locking the monitored octree whenever a read operation takes place onPlanningSceneMonitor
sscene_
shows that this is an implicit assumption in the codebase: Any octomap shape in theplanning_scene::PlanningScene
ofPlanningSceneMonitor
contains a shared_ptr to the monitored octree, which currently does not hold. The proposed fix would make this assumption valid again.