gazebosim / gazebo-classic

Gazebo classic. For the latest version, see https://github.com/gazebosim/gz-sim
http://classic.gazebosim.org/
Other
1.19k stars 480 forks source link

Removing model from plugin crashes with null pointer #1629

Open osrf-migration opened 9 years ago

osrf-migration commented 9 years ago

Original report (archived issue) by Elte Hupkes (Bitbucket: ElteHupkes).


I'm writing a plugin which inserts a model into a world, checks for contacts and then removes it again. This works fine until the RemoveModel step, which crashes the plugin with a NULL boost smart pointer. I've tried doing it several ways already (through a topic, through world_->Clear() which would also be fine in my case), but every approach sooner or later crashes with either a threading error, a boost null pointer dereferencing or a segmentation fault. Probably I'm clearing the models at an unexpected moment, but I honestly have no idea how to work around it. It also seems that which approach crashes when depends on the complexity of the model.

The simplest case, using RemoveModel, crashes always as long as the model has more than one link. I've managed to isolate it into a plugin that causes it:

#!c++

#include <gazebo/gazebo.hh>
#include <gazebo/physics/physics.hh>
#include <boost/thread/mutex.hpp>
#include <boost/lexical_cast.hpp>
#include <iostream>
#include <queue>

namespace gazebo
{
class BreakPlugin : public WorldPlugin {
public:
    BreakPlugin() : WorldPlugin() {}

    void Load(physics::WorldPtr _world, sdf::ElementPtr _sdf) {
        std::cout << "Plugin loaded." << std::endl;

        counter_ = 0;

        world_ = _world;

        // Pause the world
        world_->SetPaused(true);

        // Create a new transport node for advertising / subscribing
        node_.reset(new transport::Node());
        node_->Init();

        // Subscribe to model insertion
        modelSub_ = node_->Subscribe("~/model/info", &BreakPlugin::OnModel, this);

        // Subscribe to contacts
        contactsSub_ = node_->Subscribe("~/physics/contacts", &BreakPlugin::OnContacts, this);

        this->AddNew();
    }

    void AddNew() {
        // The sphere model from the world plugin example - remove works
        // fine with one link, crashes with two.
        sdf::SDF sphereSDF;
        sphereSDF.SetFromString(
                "<sdf version ='1.5'>\
                      <model name ='sphere'>\
                        <pose>1 0 0 0 0 0</pose>\
                        <link name ='link'>\
                          <pose>0 0 .5 0 0 0</pose>\
                          <collision name ='collision'>\
                            <geometry>\
                              <sphere><radius>0.5</radius></sphere>\
                            </geometry>\
                          </collision>\
                          <visual name ='visual'>\
                            <geometry>\
                              <sphere><radius>0.5</radius></sphere>\
                            </geometry>\
                          </visual>\
                        </link>\
                        <link name ='link2'>\
                          <pose>0 0 1.5 0 0 0</pose>\
                          <collision name ='collision'>\
                            <geometry>\
                              <sphere><radius>0.5</radius></sphere>\
                            </geometry>\
                          </collision>\
                          <visual name ='visual'>\
                            <geometry>\
                              <sphere><radius>0.5</radius></sphere>\
                            </geometry>\
                          </visual>\
                        </link>\
                      </model>\
                    </sdf>");

        sdf::ElementPtr model = sphereSDF.root->GetElement("model");
        std::string name = "unique_sphere_"+boost::lexical_cast<std::string>(counter_);
        model->GetAttribute("name")->SetFromString(name);
        world_->InsertModelSDF(sphereSDF);
        std::cout << "Added model #" << counter_ << std::endl;
    }

    void OnModel(ConstModelPtr & msg) {
        // Unpause the world for a step of processing
        world_->SetPaused(false);
        std::cout << "Models in world: " << world_->GetModelCount() << std::endl;
    }

    void OnContacts(ConstContactsPtr &msg) {
        // Handle contacts
        world_->SetPaused(true);
        std::string name = "unique_sphere_"+boost::lexical_cast<std::string>(counter_);

        // Crash will occur shortly after this - not removing the model works fine.
        world_->RemoveModel(name);

        // This output is still shown, so I'm assuming the crash occurs on another thread
        std::cout << "Model removed." << std::endl;
    }

private:
    // Store a pointer to the world
    physics::WorldPtr world_;

    // Request counter
    int counter_;

    // Transport node
    transport::NodePtr node_;

    // Subscribers / publishers
    transport::SubscriberPtr modelSub_;
    transport::SubscriberPtr contactsSub_;
};

GZ_REGISTER_WORLD_PLUGIN(BreakPlugin)

}

Just attach this plugin to any world, run it, and it will crash with:

/usr/include/boost/smart_ptr/shared_ptr.hpp:653: typename boost::detail::sp_member_access<T>::type boost::shared_ptr<T>::operator->() const [with T = gazebo::transport::Node; typename boost::detail::sp_member_access<T>::type = gazebo::transport::Node*]: Assertion `px != 0' failed.

I've been testing it with this empty zero gravity world:

<?xml version="1.0" ?>
<sdf version="1.5">
  <world name="default">
    <physics type="ode">
      <gravity>0 0 0</gravity>
    </physics>

  <plugin name="breakplugin" filename="libbreakplugin.so" />
  </world>
</sdf>

Unfortunately I haven't yet been able to figure out why this happens. Is there anything about the way I'm deleting the model that is unsafe? If I do it with world_->Clear() at the same point it works fine for this example, but crashes with a segfault with another (the conditions of which I can't yet pinpoint, but I think it has something to do with a sensor updating after the model is deleted).

If possible, could someone perhaps suggest a workaround, i.e. a way of deleting a model safely?

osrf-migration commented 9 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


Can you get a backtrace to see which thread is causing the failure?

$ gdb gzserver
...
(gdb) r your_world_name.world
...
failure
...
(gdb) thread apply all bt

If you make a test repository with the plugin and world file (similar to the one in #1624), I'll be more able to find time to test it.

osrf-migration commented 9 years ago

Original comment by Elte Hupkes (Bitbucket: ElteHupkes).


Ok, the repository can be found here. The easiest way to run, from the repo root:

mkdir build
cd build
cmake ../src
make
cd ../tools
./run

This starts a gdb session with the correct model / plugin paths loaded into environment variables. To run, do r testworld.sdf.

The repo has two branches:

Plugin loaded.
Added model #0
Models in world: 1
World cleared, adding new...
Added model #0
Models in world: 1
World cleared, adding new...
Added model #0
*** Error in `/usr/bin/gzserver': free(): invalid next size (fast): 0x00007fff7829ef00 ***

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff99bf8700 (LWP 13213)]
0x00007ffff5e4b119 in ?? () from /usr/lib/x86_64-linux-gnu/libsdformat.so.2

That may or me not be an entirely different issue - if you want, I'll create another issue for it.

osrf-migration commented 9 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


It looks like a transport problem according to the failure in thread 5. I'll cc @caguero since he's the transport expert.

osrf-migration commented 9 years ago

Original comment by Elte Hupkes (Bitbucket: ElteHupkes).


Alright. FYI, the "clear" branch seems to be a different issue - indeed it doesn't seem to segfault if the deleted model doesn't have a sensor (an IMU sensor, at least). This is my workaround for now, I don't need the sensors for contact detection, so I remove them before inserting the model. Would you like me to report a different issue for this?

osrf-migration commented 9 years ago

Original comment by Elte Hupkes (Bitbucket: ElteHupkes).


I'd like to add some info to this since I'm running into it again, this time without any hope for a workaround. The world plugin I'm using is really quite simple in that it just inserts and removes certain models based on some simulation state. I consistently get boost null pointers on Thread 1 on removing models using an entity_delete request though. Relevant part of the gdb output:

Thread 1 (Thread 0x7ffff7f80880 (LWP 26363)):
#0  0x00007ffff49e7cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff49eb0d8 in __GI_abort () at abort.c:89
#2  0x00007ffff49e0b86 in __assert_fail_base (
    fmt=0x7ffff4b31830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7ffff70afb21 "px != 0", 
    file=file@entry=0x7ffff70afef8 "/usr/include/boost/smart_ptr/shared_ptr.hpp", line=line@entry=653, 
    function=function@entry=0x7ffff70b2580 "typename boost::detail::sp_member_access<T>::type boost::shared_ptr<T>::operator->() const [with T = gazebo::physics::Entity; typename boost::detail::sp_member_access<T>::type = gazebo::physics::Entit"...) at assert.c:92
#3  0x00007ffff49e0c32 in __GI___assert_fail (assertion=0x7ffff70afb21 "px != 0", 
    file=0x7ffff70afef8 "/usr/include/boost/smart_ptr/shared_ptr.hpp", line=653, 
    function=0x7ffff70b2580 "typename boost::detail::sp_member_access<T>::type boost::shared_ptr<T>::operator->() const [with T = gazebo::physics::Entity; typename boost::detail::sp_member_access<T>::type = gazebo::physics::Entit"...) at assert.c:101
#4  0x00007ffff705c2dc in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.5
#5  0x00007ffff707393e in gazebo::sensors::ContactSensor::Fini() ()
   from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.5
#6  0x00007ffff70a3ca6 in gazebo::sensors::SensorManager::SensorContainer::RemoveSensor(std::string const&) () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.5
#7  0x00007ffff70a4258 in gazebo::sensors::SensorManager::Update(bool) ()
   from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.5
#8  0x000000000041583f in gazebo::Server::Run() ()
#9  0x000000000040dde9 in ?? ()
#10 0x00007ffff49d2ec5 in __libc_start_main (main=0x40dcf0, argc=3, argv=0x7fffffffd828, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd818)
    at libc-start.c:287
#11 0x000000000040fb2e in _start ()

Again, this only seems to happen if the model has sensors - which makes sense given that the error here appears to originate from a ContactSensor.

Curiously, this also only appears to happen when using gzserver, I've yet to see it happen using the gazebo command, though this might be a coincidence.

osrf-migration commented 9 years ago

Original comment by Elte Hupkes (Bitbucket: ElteHupkes).


Problem's still there with Gazebo 6 by the way, and occurs whenever I repeatedly insert / remove a model through the messaging interface and the model has a contact sensor. Sometimes the error's a boost null pointer as above, sometimes it is a segmentation fault; I suspect sensors are deleted twice under certain circumstances.

osrf-migration commented 9 years ago

Original comment by Elte Hupkes (Bitbucket: ElteHupkes).


Ok I'm pretty sure that this is what happens to trigger the crash on the sensor thread:

The only thing I haven't been able to figure out is why this doesn't happen when the GUI is enabled, i.e. it happens when running gzserver only, but not when running gazebo or when running gzserver and gzclient separately. Possibly the extra I/O overhead caused by communicating all the deletes to the client causes the thread to be preempted and the sensor thread to run before the entities are deleted from the world.

osrf-migration commented 9 years ago

Original comment by Elte Hupkes (Bitbucket: ElteHupkes).


I can confirm the above is indeed the cause of the null pointer with the contact sensor, replacing the contact sensor with one that doesn't try to get its parent at this stage solves the issue. It can simply be worked around by storing the topic name the first time it is calculated and using that to unregister it.

I've also managed to isolate why the segfault occurs, which is a different issue though triggered through a similar mechanism. At the point of the segfault we have at thread 1:

Thread 1 (Thread 0x7ffff7f7d880 (LWP 5756)):
#0  0x00007ffff4da2088 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#1  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#2  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#3  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#4  0x00007ffff4da1e34 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#5  0x00007ffff6303ef6 in gazebo::physics::Base::~Base() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#6  0x00007ffff6340ff9 in gazebo::physics::Link::~Link() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#7  0x00007ffff12eff09 in gazebo::physics::ODELink::~ODELink() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics_ode.so.6
#8  0x00007ffff5ff71ce in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#9  0x00007ffff601ae3a in gazebo::sensors::ImuSensor::~ImuSensor() () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#10 0x00007ffff601aed9 in gazebo::sensors::ImuSensor::~ImuSensor() () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#11 0x00007ffff5ff71ce in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#12 0x00007ffff603915e in gazebo::sensors::SensorManager::SensorContainer::RemoveSensor(std::string const&) () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#13 0x00007ffff6039688 in gazebo::sensors::SensorManager::Update(bool) () from /usr/lib/x86_64-linux-gnu/libgazebo_sensors.so.6
#14 0x00007ffff7bb1037 in gazebo::Server::Run() () from /usr/lib/x86_64-linux-gnu/libgazebo.so.6
#15 0x0000000000402449 in ?? ()
#16 0x00007ffff68b7ec5 in __libc_start_main (main=0x402350, argc=2, argv=0x7fffffffd7f8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd7e8) at libc-start.c:287
#17 0x00000000004029b8 in _start ()

Meanwhile at thread 23:

Thread 23 (Thread 0x7fff9d1f7700 (LWP 5781)):
#0  0x00007ffff69150e7 in _int_free (av=0x7fff78000020, p=<optimized out>, have_lock=0) at malloc.c:3987
#1  0x00007ffff4da00c2 in sdf::Element::~Element() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#2  0x00007ffff4da0349 in sdf::Element::~Element() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#3  0x00007ffff4da1f01 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#4  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#5  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#6  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#7  0x00007ffff4da1ecc in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#8  0x00007ffff4da1e34 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#9  0x00007ffff4da1e34 in sdf::Element::Reset() () from /usr/lib/x86_64-linux-gnu/libsdformat.so.3
#10 0x00007ffff6303ef6 in gazebo::physics::Base::~Base() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#11 0x00007ffff6354e79 in gazebo::physics::Model::~Model() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#12 0x00007ffff62efe0e in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#13 0x00007ffff6388386 in gazebo::physics::World::RemoveModel(std::string const&) () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#14 0x00007ffff6388bbc in gazebo::physics::World::ProcessEntityMsgs() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#15 0x00007ffff6391190 in gazebo::physics::World::ProcessMessages() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#16 0x00007ffff6395198 in gazebo::physics::World::Step() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#17 0x00007ffff6395615 in gazebo::physics::World::RunLoop() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.6
#18 0x00007ffff4b6ba4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#19 0x00007ffff5796182 in start_thread (arg=0x7fff9d1f7700) at pthread_create.c:312
#20 0x00007ffff699047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Both the main thread and the sensor thread call Element::Reset() on the sensor element. This would be fine, if they didn't run consecutively, at some point during both the calls the smart pointers' reference counter of one of these elements goes to zero, and the Element::~Element() is called before the call to Element::Reset() on the different thread finishes. This call then tries to access the dataPtr of the element, which was just deleted by Element::~Element(), resulting in the segfault. I'm not sure what the best solution to this problem is, apart from having to make sure that no two threads are resetting the same element consecutively.. Perhaps Element::Reset() should be made thread-safe as an issue of SDFormat?

@scpeters this issue has become a mess because it's actually separate issues:

  1. A bug in the ContactSensor as detailed in the previous comment
  2. The segfault from this issue
  3. The original issue with networking, which I haven't reproduced in a while, but might go hand in hand with one of the above

I've little hope anything is going to be solved with the issue in this state. Is it a good idea if I create new bugreports for (1) and (2), leaving this issue for (3) until I have more information about it?

osrf-migration commented 8 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).