gazebosim / gazebo-classic

Gazebo classic. For the latest version, see https://github.com/gazebosim/gz-sim
http://classic.gazebosim.org/
Other
1.19k stars 480 forks source link

Race condition when deleting a model #1739

Open osrf-migration opened 9 years ago

osrf-migration commented 9 years ago

Original report (archived issue) by Elte Hupkes (Bitbucket: ElteHupkes).


(From comment 21184816 of issue #1629, since this is a separate issue)

There is a race condition which might lead Gazebo to crash with a segmentation fault when deleting a model and this model has a sensor. At the time of the crash, a backtrace of thread 1 shows:

Thread 1 (Thread 0x7ffff7f7d880 (LWP 5756)):
#0  0x00007ffff4da2088 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#1  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#2  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#3  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#4  0x00007ffff4da1e34 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#5  0x00007ffff6303ef6 in gazebo::physics::Base::~Base() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#6  0x00007ffff6340ff9 in gazebo::physics::Link::~Link() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#7  0x00007ffff12eff09 in gazebo::physics::ODELink::~ODELink() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics_ode.so.6
#8  0x00007ffff5ff71ce in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#9  0x00007ffff601ae3a in gazebo::sensors::ImuSensor::~ImuSensor() () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#10 0x00007ffff601aed9 in gazebo::sensors::ImuSensor::~ImuSensor() () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#11 0x00007ffff5ff71ce in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#12 0x00007ffff603915e in gazebo::sensors::SensorManager::SensorContainer::RemoveSensor(std::string const&) () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#13 0x00007ffff6039688 in gazebo::sensors::SensorManager::Update(bool) () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#14 0x00007ffff7bb1037 in gazebo::Server::Run() () from /usr/lib/x86_64-linux-gnu/libgazebo.so.6
#15 0x0000000000402449 in ?? ()
#16 0x00007ffff68b7ec5 in __libc_start_main (main=0x402350, argc=2, argv=0x7fffffffd7f8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd7e8) at libc-start.c:287
#17 0x00000000004029b8 in _start ()

And thread 23:

Thread 23 (Thread 0x7fff9d1f7700 (LWP 5781)):
#0  0x00007ffff69150e7 in _int_free (av=0x7fff78000020, p=<optimized out>, have_lock=0) at malloc.c:3987
#1  0x00007ffff4da00c2 in sdf::Element::~Element() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#2  0x00007ffff4da0349 in sdf::Element::~Element() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#3  0x00007ffff4da1f01 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#4  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#5  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#6  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#7  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#8  0x00007ffff4da1e34 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#9  0x00007ffff4da1e34 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#10 0x00007ffff6303ef6 in gazebo::physics::Base::~Base() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#11 0x00007ffff6354e79 in gazebo::physics::Model::~Model() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#12 0x00007ffff62efe0e in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#13 0x00007ffff6388386 in gazebo::physics::World::RemoveModel(std::string const&) () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#14 0x00007ffff6388bbc in gazebo::physics::World::ProcessEntityMsgs() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#15 0x00007ffff6391190 in gazebo::physics::World::ProcessMessages() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#16 0x00007ffff6395198 in gazebo::physics::World::Step() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#17 0x00007ffff6395615 in gazebo::physics::World::RunLoop() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#18 0x00007ffff4b6ba4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#19 0x00007ffff5796182 in start_thread (arg=0x7fff9d1f7700) at pthread_create.c:312
#20 0x00007ffff699047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Both threads perform Element::Reset(), one on the parent model, the other on the child link containing the sensor (both of which called by their respective destructors). These Reset()'s zero the reference counters of the SDF elements, causing their destructors to be called. However, since both are operating partially on the same elements, it might happen that one partial Reset() call is scheduled after a destructor has already deleted the Element's dataPtr. This then results in a segfault.

Now it seems to me that SDFormat is not designed to be thread-safe, so it would be Gazebo's job to handle this. I'm not sure how though... Use a mutex somewhere that needs to be acquired before calling Element::Reset() on any Base?

osrf-migration commented 8 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


There are other race conditions as well. I just found one in the INTEGRATION_world test (backtrace here).

The WorldTest.RemoveModelPaused test loads a world in a paused state, takes one physics step, and then calls World::RemoveModel to delete two models and verify that they are deleted. This function locks mutexes, including the physicsUpdateMutex when the model is being deleted. There is a race condition, however, as the receiveMutex is not locked, and the following sequence can occur which caused the seg-fault recorded here:

The last call reads from data structures that are being modified, which leads to the race condition.

osrf-migration commented 8 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


I just observed a similar backtrace (details here), so this hasn't yet been fixed.

osrf-migration commented 8 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


related to #1223