Issue in thread safety after #1462

Aneoshun commented 4 years ago

Hi,

It seems that #1462 makes dart less thread-safe. I am currently unable to provide a lot of details at the moment (lack of time), but I thought I had to report this asap given that it has been merged to master recently.

Following the use case that I mentioned in #1460 I clone a lot of skeletons and then run simulations in parallel (in short, the for loop is replaced by a parallel_for from tbb).

This has been working for years and is still working with commit (3 commits ago) becbeada2a8ba94834e69ceb9f4efaa92381b756 But using the commit associated with #1462 leads to systematic segfaults happening at a random point in time.

Would you have an idea of what in the #1462 fix could be causing this? Otherwise, I will investigate this as soon as I have some time.

mxgrey commented 4 years ago

I wonder if maybe the signals aren't being cloned correctly. When a skeleton is cloned, all of its connections internally and externally are supposed to be completely independent of the original. If events that are happening in parallel threads are affecting each other because of the signals, then that suggests to me that the signals are being shallow copied instead of deep copied when a cloning happens.

One possible workaround you could try if you don't want to be blocked by this bug is to completely set up all of your simulation worlds before the parallel loops begin and keep all of them alive until all the threads are joined back together. I think that should prevent this bug from negatively impacting you until it can be fixed.

Aneoshun commented 4 years ago

@mxgrey Thanks for your reply. I share your intuition, as it is clear from the problem that I raised last week (#1460) that the signals are not independent between the clones and the original. i.e., making millions of clones makes the original to have millions of (mainly disconnected) connections. I thought this was made on purpose (even if I did not understand why), and thought that the most important thing was to clear the closed connections.

I am using tens of millions of cloned robots, so instantiating them all at the same time would not be feasible. An alternative would be to instantiate just enough clones for the ones that will concurrently used, but that won't be trivial to implement. For now, I simply pined the dart commit SHA we used to a previous commit. We are still facing #1460, but this becomes a problem only after cloning a couple of millions of robots. We can live with that for a short period of time.

mxgrey commented 4 years ago

If all the clones are identical and don't need to be modified, then I think a "clone pool" would be a good solution. Just have something like a std::vector<SkeletonPtr> to draw from, with a std::mutex protecting it. When a thread is finished, it locks the mutex and dumps its skeleton into the container instead of just discarding it. When a new thread is starting, it locks the mutex, and pulls a skeleton from the container (unless the container is empty, in which case the thread can clone from the original). If you need to make sure they start out with the same state, then you can copy the state of the original into the clone at the start of the thread. Besides being a workaround for this specific bug, you might also see some modest performance improvements by doing this.

costashatz commented 4 years ago

If all the clones are identical and don't need to be modified, then I think a "clone pool" would be a good solution. Just have something like a std::vector<SkeletonPtr> to draw from, with a std::mutex protecting it. When a thread is finished, it locks the mutex and dumps its skeleton into the container instead of just discarding it.

I am using a clone pool in all of my tests. This is working quite robustly and I do not have any big leaks. This is not very difficult to do. @Aneoshun I can share the code if you wish.

Besides being a workaround for this specific bug, you might also see some modest performance improvements by doing this.

Indeed this is faster than re-cloning every-time. I can confirm this in my tests. But of course, this specific bug should be addressed.

Aneoshun commented 4 years ago

Yes, I think this bug should be addressed, mainly to avoid future problems.

@costashatz your clone pool looks great. Could you please share your code? (you can send it by email if you prefer).

costashatz commented 4 years ago

@costashatz your clone pool looks great. Could you please share your code? (you can send it by email if you prefer).

Here's one way of doing the pool:

class MyClonePool {
public:
    static MyClonePool* instance()
    {
        static MyClonePool gdata;
        return &gdata;
    }

    MyClonePool(const MyClonePool&) = delete;
    void operator=(const MyClonePool&) = delete;

    dart::dynamics::SkeletonPtr get_skeleton()
    {
        std::lock_guard<std::mutex> lock(_skeleton_mutex);
        for (size_t i = 0; i < _num_skeletons; i++) {
            if (_free[i]) {
                _free[i] = false;
                return _skeletons[i];
            }
        }

        return nullptr;
    }

    void free_skeleton(const dart::dynamics::SkeletonPtr& skel)
    {
        std::lock_guard<std::mutex> lock(_skeleton_mutex);
        for (size_t i = 0; i < _num_skeletons; i++) {
            if (_skeletons[i] == skel) {
                _set_init_pos(skel);
                _free[i] = true;
                break;
            }
        }
    }

protected:
    std::vector<dart::dynamics::SkeletonPtr> _skeletons;
    std::vector<bool> _free;
    size_t _num_skeletons = 10; // set this number to your maximum number of concurrent skeletons (or slightly bigger to be sure! ;))
    std::mutex _skeleton_mutex;

    MyClonePool() {
        // load one skeleton from file and clone it _num_skeletons times
        // OR load _num_skeletons different skeletons from file
        // do not forget to fill _skeletons with your skeletons

        // Initialize all skeletons to free
        for (size_t i = 0; i < _num_skeletons; i++) {
            _free.push_back(true);
        }
    }

    void _set_init_pos(const dart::dynamics::SkeletonPtr& skel)
    {
        skel->resetPositions();
        skel->resetVelocities();
        skel->resetAccelerations();
        skel->clearExternalForces();

        // reset any altered properties (e.g., altered mass of a link) and set initial positions
    }
};

Then you use it like this:

void evaluation_function() {
    // Acquire a freed skeleton
    dart::dynamics::SkeletonPtr skel = nullptr;
    while (skel == nullptr) skel = MyClonePool::instance()->get_skeleton();

    // do your stuff with skel

   // Free your skeleton so that a new function can take it
   MyClonePool::instance()->free_skeleton(skel);
}

If you do not like singletons, you can define a global namespace and put your SkeletonPool object there (this is still a singleton but in a different way). If you do not like globals/singletons, you can create an object and pass it around your structures.

I hope this is helpful.

Aneoshun commented 4 years ago

Thanks a lot. I will use this while the clone function of the skeleton get fixed.

costashatz commented 3 years ago

@mxgrey @jslee02 any updates on this one? It's not good to have non-thread safe code hanging around..

costashatz commented 2 years ago

@mxgrey @jslee02 any updates?

dartsim / dart

Issue in thread safety after #1462 #1468