OctoMap / octomap

An Efficient Probabilistic 3D Mapping Framework Based on Octrees. Contains the main OctoMap library, the viewer octovis, and dynamicEDT3D.
http://octomap.github.io
1.93k stars 657 forks source link

OcTree segfaults during FCLShapeCache invalidation #423

Closed tahsinkose closed 1 month ago

tahsinkose commented 1 month ago

This problem, which very rarely happens, has haunted me for the last couple of years. The last time it happened was more than 2 years ago, and recently I got hit by it again.

The coredump of segmentation fault is as follows:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fae1b30b03e in octomap::OcTreeBaseImpl<octomap::OcTreeNode, octomap::AbstractOccupancyOcTree>::deleteNodeRecurs (this=0x7fad5c5ed420, node=0x7fad00000003) at /opt/ros/noetic/include/octomap/OcTreeBaseImpl.hxx:662
Thread 1 (Thread 0x7fad857fa700 (LWP 508197)):
#0  0x00007fae1b30b03e in octomap::OcTreeBaseImpl<octomap::OcTreeNode, octomap::AbstractOccupancyOcTree>::deleteNodeRecurs(octomap::OcTreeNode*) (this=0x7fad5c5ed420, node=0x7fad00000003) at /opt/ros/noetic/include/octomap/OcTreeBaseImpl.hxx:662
... bunch of recursive deleteNodeRecurs calls...
#15 0x00007fae1b30b061 in octomap::OcTreeBaseImpl<octomap::OcTreeNode, octomap::AbstractOccupancyOcTree>::deleteNodeRecurs(octomap::OcTreeNode*) (this=0x7fad5c5ed420, node=0x7fad5d0725d0) at /opt/ros/noetic/include/octomap/OcTreeBaseImpl.hxx:662
#16 0x00007fae1b30b107 in octomap::OcTreeBaseImpl<octomap::OcTreeNode, octomap::AbstractOccupancyOcTree>::clear() (this=0x7fad5c5ed420) at /opt/ros/noetic/include/octomap/OcTreeBaseImpl.hxx:512
#17 octomap::OcTreeBaseImpl<octomap::OcTreeNode, octomap::AbstractOccupancyOcTree>::~OcTreeBaseImpl() (this=0x7fad5c5ed420, __in_chrg=<optimized out>) at /opt/ros/noetic/include/octomap/OcTreeBaseImpl.hxx:69
#18 0x00007fae0929e5ac in fcl::OcTree<double>::~OcTree() () at /opt/ros/noetic/lib/x86_64-linux-gnu/libfcl.so.0.6
#19 0x00007fae12fb9040 in std::_Sp_counted_ptr<collision_detection::FCLGeometry*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () at /opt/ros/noetic/lib/libmoveit_collision_detection_fcl.so.1.1.13
#20 0x00007fae12fbba2c in std::_Rb_tree<std::weak_ptr<shapes::Shape const>, std::pair<std::weak_ptr<shapes::Shape const> const, std::shared_ptr<collision_detection::FCLGeometry const> >, std::_Select1st<std::pair<std::weak_ptr<shapes::Shape const> const, std::shared_ptr<collision_detection::FCLGeometry const> > >, std::owner_less<std::weak_ptr<shapes::Shape const> >, std::allocator<std::pair<std::weak_ptr<shapes::Shape const> const, std::shared_ptr<collision_detection::FCLGeometry const> > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<std::weak_ptr<shapes::Shape const> const, std::shared_ptr<collision_detection::FCLGeometry const> > >) () at /opt/ros/noetic/lib/libmoveit_collision_detection_fcl.so.1.1.13
#21 0x00007fae12fb1519 in collision_detection::cleanCollisionGeometryCache() () at /opt/ros/noetic/lib/libmoveit_collision_detection_fcl.so.1.1.13
#22 0x00007fae1ab69440 in collision_detection::World::notify(std::shared_ptr<collision_detection::World::Object const> const&, collision_detection::World::Action) () at /opt/ros/noetic/lib/libmoveit_collision_detection.so.1.1.13
#23 0x00007fae1ab696b7 in collision_detection::World::removeObject(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () at /opt/ros/noetic/lib/libmoveit_collision_detection.so.1.1.13
#24 0x00007fae1aae5b96 in planning_scene::PlanningScene::processCollisionObjectRemove(moveit_msgs::CollisionObject_<std::allocator<void> > const&) () at /opt/ros/noetic/lib/libmoveit_planning_scene.so.1.1.13
#25 0x00007fae1aaf8f0b in planning_scene::PlanningScene::processCollisionObjectMsg(moveit_msgs::CollisionObject_<std::allocator<void> > const&) () at /opt/ros/noetic/lib/libmoveit_planning_scene.so.1.1.13
#26 0x00007fae1aaf9392 in planning_scene::PlanningScene::setPlanningSceneDiffMsg(moveit_msgs::PlanningScene_<std::allocator<void> > const&) () at /opt/ros/noetic/lib/libmoveit_planning_scene.so.1.1.13

I have spent a significant time to find the root cause of the problem, but there is still one missing piece, which I would love to have a helping hand. Before that piece, let me ask two important introductory questions:

Q: How the octomap::OcTree inside the FCL Shape Cache go into destruction? A: If we would have used octomap_monitor_->getOcTreePtr() everywhere to retrieve a shared_ptr to the monitored octomap::OcTree (collision_detection::OccMapTree to be pedantic in our case, which extends octomap::OcTree), then such a destruction would not happen. But that is not the case. There are two critical places that construct a whole different collision_detection::OccMapTree object. If this object's ref count reaches 0, the destruction stage initiates. This explains how.

Q: Why the octomap::OcTree inside the FCL Shape Cache go into destruction? A: It takes at least two planning_scene::PlanningScene objects with different lifetimes. The following pseudocode illustrates it:

auto scene_of_psm = psm->getPlanningScene();
moveit_msgs::PlanningScene msg; <A diff planning_scene msg with a non-empty octomap>
scene_of_psm->setPlanningSceneDiffMsg(msg);
ASSERT_TRUE(scene->getWorld()->hasObject(planning_scene::PlanningScene::OCTOMAP_NS));

At this time, FCLShapeCache contains a key-value pair as shape->octree of scene_of_psm, let's call them sOT1->OT1.


moveit_msgs::PlanningScene scene_msg;
 {
    planning_scene_monitor::LockedPlanningSceneRO locked_scene_ro(psm);
    locked_scene_ro->getPlanningSceneMsg(scene_msg);
 }
{
    planning_scene::PlanningScenePtr other_scene =
        std::make_shared<planning_scene::PlanningScene>(psm->getRobotModel());
    // Creates sOT2 and OT2, also updates the FCL Shape cache with them.
    ASSERT_TRUE(other_scene->usePlanningSceneMsg(scene_msg));
    ASSERT_TRUE(other_scene->getWorld()->hasObject(planning_scene::PlanningScene::OCTOMAP_NS));
}

FCLShapeCache now contains a key-value pair as shape->octree of other_scene, which are sOT2->OT2. Now, there is an important bit. Beware that other_scene went out of scope. The shape corresponding to its unique octree (i.e. sOT2) reached the ref_cnt=0. The FCLShapeCache will notice it when cleanCollisionGeometryCache is called, but not now. The following statement triggers the cache invalidation chain reaction that can be seen in the coredump:

scene_of_psm->getWorldNonConst()->removeObject(planning_scene::PlanningScene::OCTOMAP_NS);

cleanCollisionGeometryCachecalls for a forced cleanup of expired key-value pairs. Expiration is checked on the key, a const weak shared_ptr to shapes::Shape polymorphic types (in our case sOT2). The destruction happens on the value, which is of type FCLGeometryConstPtr, which is a wrapper on top of OT2. This explains why. Following logs also show the flow of events:

[ INFO] [1724780485.232404095]: Created parent shape for octree at 0x562d03a5fb80
[ INFO] [1724780485.232447396]: Collision data structures for <octomap> retrieved from cache with root node: 0x562d03a5bb50
[ INFO] [1724780485.232471962]: Octree ptr addr from obj shape: 0x562d03a5fb80 - with total sp-count: 2 - with root node: 0x562d03a5bb50 - with parent shape addr: 0x562d03a5b230 - with parent shape sp-count: 2
[ INFO] [1724780485.232858935]: Created parent shape for octree at 0x562d03900340
[ INFO] [1724780485.232905622]: Collision data structures for <octomap> retrieved from cache with root node: 0x562d03a5f550
[ INFO] [1724780485.232939095]: Octree ptr addr from obj shape: 0x562d03900340 - with total sp-count: 2 - with root node: 0x562d03a5f550 - with parent shape addr: 0x562d038fd060 - with parent shape sp-count: 2
[ INFO] [1724780485.233034593]: Parent shape of <octomap> with root node: 0x562d03a5f550 seems to have expired. Removing from the cache!

Now comes the missing piece. For the abovementioned coredump to happen, there has to be a simultaneous write operation going on OT2 via a raw pointer (if there is another shared pointer instance, then the destruction would not start since ref_count would not be 0) at the time of destruction, such that a prune or delete creates a data race. Nevertheless, I could not find such a usage in MoveIt even though I stringently inspected the codebase. All octomap::OcTree usages are via shared pointers and they are even const. If anyone can think of a way that this could happen, I'd love to hear it.

The only remaining possibility I see is a broken collision_detection::OccMapTree at the time of construction in createOctomap method such that a NULL child node slept in the OcTree until destruction.


How to fix?

Whenever a collision_detection::OccMapTree is required, use the one from OccupancyMapMonitor with octomap_monitor_->getOcTreePtr(). This way the ref count never reaches 0, which means no destruction.

Besides, there is an additional problem in allowing other instances of collision_detection::OccMapTree. See the following snippet:

auto scene_of_psm = psm->getPlanningScene(); // Assume it already contains an octomap shape constructed with the octree pointer received from octomap_monitor_
moveit_msgs::PlanningScene msg; <A diff planning_scene msg with a non-empty octomap shape>
scene_of_psm->setPlanningSceneDiffMsg(msg);

The first line refers to here. However, at the end a different collision_detection::OccMapTree than the monitored one resides in the octomap shape. Typically, we trigger a scene update event after overwriting the scene of PlanningSceneMonitor:

psm->triggerSceneUpdateEvent(UPDATE_SCENE);

which causes the scenePublishingThread to run one cycle. In its loop body, it acquires a read lock on the monitored octree, however the scene_of_psm actually contains a different octree, which makes that lock useless. Furthermore, locking the monitored octree whenever a read operation takes place on PlanningSceneMonitors scene_ shows that this is an implicit assumption in the codebase: Any octomap shape in the planning_scene::PlanningScene of PlanningSceneMonitor contains a shared_ptr to the monitored octree, which currently does not hold. The proposed fix would make this assumption valid again.

tahsinkose commented 1 month ago

Sorry guys, wrong repository 🤣