Open osrf-migration opened 9 years ago
Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).
Can you get a backtrace to see which thread is causing the failure?
$ gdb gzserver
...
(gdb) r your_world_name.world
...
failure
...
(gdb) thread apply all bt
If you make a test repository with the plugin and world file (similar to the one in #1624), I'll be more able to find time to test it.
Original comment by Elte Hupkes (Bitbucket: ElteHupkes).
Ok, the repository can be found here. The easiest way to run, from the repo root:
mkdir build
cd build
cmake ../src
make
cd ../tools
./run
This starts a gdb session with the correct model / plugin paths loaded into environment variables. To run, do r testworld.sdf
.
The repo has two branches:
master
, which is pretty much the example above, only it loads the model from file and not from SDF directly (doesn't make a difference for the error)clear
, which calls World::Clear()
instead of World::RemoveModel()
. Like I mentioned before, this works fine with the initial model, so I'm using a different model here which has an IMU-sensor, I suspect the sensor's what's causing the problem. This actually segfaults, so it may take a few inserts / removes before invalid memory is accessed. For me, the gdb output of this second example is:Plugin loaded.
Added model #0
Models in world: 1
World cleared, adding new...
Added model #0
Models in world: 1
World cleared, adding new...
Added model #0
*** Error in `/usr/bin/gzserver': free(): invalid next size (fast): 0x00007fff7829ef00 ***
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff99bf8700 (LWP 13213)]
0x00007ffff5e4b119 in ?? () from /usr/lib/x86_64-linux-gnu/libsdformat.so.2
That may or me not be an entirely different issue - if you want, I'll create another issue for it.
Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).
It looks like a transport problem according to the failure in thread 5. I'll cc @caguero since he's the transport expert.
Original comment by Elte Hupkes (Bitbucket: ElteHupkes).
Alright. FYI, the "clear" branch seems to be a different issue - indeed it doesn't seem to segfault if the deleted model doesn't have a sensor (an IMU sensor, at least). This is my workaround for now, I don't need the sensors for contact detection, so I remove them before inserting the model. Would you like me to report a different issue for this?
Original comment by Elte Hupkes (Bitbucket: ElteHupkes).
I'd like to add some info to this since I'm running into it again, this time without any hope for a workaround. The world plugin I'm using is really quite simple in that it just inserts and removes certain models based on some simulation state. I consistently get boost null pointers on Thread 1 on removing models using an entity_delete
request though. Relevant part of the gdb output:
Thread 1 (Thread 0x7ffff7f80880 (LWP 26363)):
#0 0x00007ffff49e7cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff49eb0d8 in __GI_abort () at abort.c:89
#2 0x00007ffff49e0b86 in __assert_fail_base (
fmt=0x7ffff4b31830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x7ffff70afb21 "px != 0",
file=file@entry=0x7ffff70afef8 "/usr/include/boost/smart_ptr/shared_ptr.hpp", line=line@entry=653,
function=function@entry=0x7ffff70b2580 "typename boost::detail::sp_member_access<T>::type boost::shared_ptr<T>::operator->() const [with T = gazebo::physics::Entity; typename boost::detail::sp_member_access<T>::type = gazebo::physics::Entit"...) at assert.c:92
#3 0x00007ffff49e0c32 in __GI___assert_fail (assertion=0x7ffff70afb21 "px != 0",
file=0x7ffff70afef8 "/usr/include/boost/smart_ptr/shared_ptr.hpp", line=653,
function=0x7ffff70b2580 "typename boost::detail::sp_member_access<T>::type boost::shared_ptr<T>::operator->() const [with T = gazebo::physics::Entity; typename boost::detail::sp_member_access<T>::type = gazebo::physics::Entit"...) at assert.c:101
#4 0x00007ffff705c2dc in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.5
#5 0x00007ffff707393e in gazebo::sensors::ContactSensor::Fini() ()
from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.5
#6 0x00007ffff70a3ca6 in gazebo::sensors::SensorManager::SensorContainer::RemoveSensor(std::string const&) () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.5
#7 0x00007ffff70a4258 in gazebo::sensors::SensorManager::Update(bool) ()
from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.5
#8 0x000000000041583f in gazebo::Server::Run() ()
#9 0x000000000040dde9 in ?? ()
#10 0x00007ffff49d2ec5 in __libc_start_main (main=0x40dcf0, argc=3, argv=0x7fffffffd828,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd818)
at libc-start.c:287
#11 0x000000000040fb2e in _start ()
Again, this only seems to happen if the model has sensors - which makes sense given that the error here appears to originate from a ContactSensor
.
Curiously, this also only appears to happen when using gzserver
, I've yet to see it happen using the gazebo
command, though this might be a coincidence.
Original comment by Elte Hupkes (Bitbucket: ElteHupkes).
Problem's still there with Gazebo 6 by the way, and occurs whenever I repeatedly insert / remove a model through the messaging interface and the model has a contact sensor. Sometimes the error's a boost null pointer as above, sometimes it is a segmentation fault; I suspect sensors are deleted twice under certain circumstances.
Original comment by Elte Hupkes (Bitbucket: ElteHupkes).
Ok I'm pretty sure that this is what happens to trigger the crash on the sensor thread:
Link::Fini()
is called for all links, which in turn calls sensors::remove_sensor()
for each of its sensorsremoveSensors
property of the SensorManager
to be removed by the main sensor threadContactSensor::Fini()
, which tries to remove the contact filter from the physics manager. To do so, it fetches the scoped name of its parent entity with this->world->GetEntity(this->parentName)->GetScopedName()
. However, per the previous point, the world no longer knows about the entity that goes by this->parentName
, so the smart pointer returned by World::GetEntity()
is null
, and ->GetScopedName()
on a null pointer crashes the code.The only thing I haven't been able to figure out is why this doesn't happen when the GUI is enabled, i.e. it happens when running gzserver
only, but not when running gazebo
or when running gzserver
and gzclient
separately. Possibly the extra I/O overhead caused by communicating all the deletes to the client causes the thread to be preempted and the sensor thread to run before the entities are deleted from the world.
Original comment by Elte Hupkes (Bitbucket: ElteHupkes).
I can confirm the above is indeed the cause of the null pointer with the contact sensor, replacing the contact sensor with one that doesn't try to get its parent at this stage solves the issue. It can simply be worked around by storing the topic name the first time it is calculated and using that to unregister it.
I've also managed to isolate why the segfault occurs, which is a different issue though triggered through a similar mechanism. At the point of the segfault we have at thread 1:
Thread 1 (Thread 0x7ffff7f7d880 (LWP 5756)):
#0 0x00007ffff4da2088 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#1 0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#2 0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#3 0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#4 0x00007ffff4da1e34 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#5 0x00007ffff6303ef6 in gazebo::physics::Base::~Base() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#6 0x00007ffff6340ff9 in gazebo::physics::Link::~Link() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#7 0x00007ffff12eff09 in gazebo::physics::ODELink::~ODELink() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics_ode.so.6
#8 0x00007ffff5ff71ce in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#9 0x00007ffff601ae3a in gazebo::sensors::ImuSensor::~ImuSensor() () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#10 0x00007ffff601aed9 in gazebo::sensors::ImuSensor::~ImuSensor() () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#11 0x00007ffff5ff71ce in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#12 0x00007ffff603915e in gazebo::sensors::SensorManager::SensorContainer::RemoveSensor(std::string const&) () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#13 0x00007ffff6039688 in gazebo::sensors::SensorManager::Update(bool) () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#14 0x00007ffff7bb1037 in gazebo::Server::Run() () from /usr/lib/x86_64-linux-gnu/libgazebo.so.6
#15 0x0000000000402449 in ?? ()
#16 0x00007ffff68b7ec5 in __libc_start_main (main=0x402350, argc=2, argv=0x7fffffffd7f8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd7e8) at libc-start.c:287
#17 0x00000000004029b8 in _start ()
Meanwhile at thread 23:
Thread 23 (Thread 0x7fff9d1f7700 (LWP 5781)):
#0 0x00007ffff69150e7 in _int_free (av=0x7fff78000020, p=<optimized out>, have_lock=0) at malloc.c:3987
#1 0x00007ffff4da00c2 in sdf::Element::~Element() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#2 0x00007ffff4da0349 in sdf::Element::~Element() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#3 0x00007ffff4da1f01 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#4 0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#5 0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#6 0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#7 0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#8 0x00007ffff4da1e34 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#9 0x00007ffff4da1e34 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#10 0x00007ffff6303ef6 in gazebo::physics::Base::~Base() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#11 0x00007ffff6354e79 in gazebo::physics::Model::~Model() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#12 0x00007ffff62efe0e in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#13 0x00007ffff6388386 in gazebo::physics::World::RemoveModel(std::string const&) () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#14 0x00007ffff6388bbc in gazebo::physics::World::ProcessEntityMsgs() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#15 0x00007ffff6391190 in gazebo::physics::World::ProcessMessages() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#16 0x00007ffff6395198 in gazebo::physics::World::Step() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#17 0x00007ffff6395615 in gazebo::physics::World::RunLoop() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#18 0x00007ffff4b6ba4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#19 0x00007ffff5796182 in start_thread (arg=0x7fff9d1f7700) at pthread_create.c:312
#20 0x00007ffff699047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Both the main thread and the sensor thread call Element::Reset()
on the sensor element. This would be fine, if they didn't run consecutively, at some point during both the calls the smart pointers' reference counter of one of these elements goes to zero, and the Element::~Element()
is called before the call to Element::Reset()
on the different thread finishes. This call then tries to access the dataPtr
of the element, which was just deleted by Element::~Element()
, resulting in the segfault. I'm not sure what the best solution to this problem is, apart from having to make sure that no two threads are resetting the same element consecutively.. Perhaps Element::Reset()
should be made thread-safe as an issue of SDFormat
?
@scpeters this issue has become a mess because it's actually separate issues:
ContactSensor
as detailed in the previous commentI've little hope anything is going to be solved with the issue in this state. Is it a good idea if I create new bugreports for (1) and (2), leaving this issue for (3) until I have more information about it?
Original report (archived issue) by Elte Hupkes (Bitbucket: ElteHupkes).
I'm writing a plugin which inserts a model into a world, checks for contacts and then removes it again. This works fine until the
RemoveModel
step, which crashes the plugin with a NULL boost smart pointer. I've tried doing it several ways already (through a topic, throughworld_->Clear()
which would also be fine in my case), but every approach sooner or later crashes with either a threading error, a boost null pointer dereferencing or a segmentation fault. Probably I'm clearing the models at an unexpected moment, but I honestly have no idea how to work around it. It also seems that which approach crashes when depends on the complexity of the model.The simplest case, using
RemoveModel
, crashes always as long as the model has more than one link. I've managed to isolate it into a plugin that causes it:Just attach this plugin to any world, run it, and it will crash with:
I've been testing it with this empty zero gravity world:
Unfortunately I haven't yet been able to figure out why this happens. Is there anything about the way I'm deleting the model that is unsafe? If I do it with
world_->Clear()
at the same point it works fine for this example, but crashes with a segfault with another (the conditions of which I can't yet pinpoint, but I think it has something to do with a sensor updating after the model is deleted).If possible, could someone perhaps suggest a workaround, i.e. a way of deleting a model safely?