ethz-asl / ethzasl_icp_mapping

3D mapping tools for robotic applications
273 stars 156 forks source link

Missing join or detach on background map thread #64

Closed cedricpradalier closed 6 years ago

cedricpradalier commented 6 years ago

https://github.com/ethz-asl/ethzasl_icp_mapping/blob/ec96db832be5ad4cf95e2551ef33797903a39bbb/ethzasl_icp_mapper/src/mapper.cpp#L480

If this thread is not detached or joined (I added mapBuildingThread.detach() after creation), the mapper will die with a thread_resource_error exception after reaching the maximum number of threads in /proc/sys/kernel/threads-max. This corresponds to nearly 70'000 iterations on my system, but it crashes nonetheless.

With the detach suggestion above, the problem disappears.

HannesSommer commented 6 years ago

Thanks for sharing this solution. This is probably a duplicate of #61. Joining seems the preferred solution to me (it seems likely to me that neither data structures nor algorithm are proplery protected against multiple background usage; or @cedricpradalier , did you check that?),

Till somebody finds the time to do that we should consider detaching as an intermediate solution.

cedricpradalier commented 6 years ago

Yes, I suspect it is a duplicate of #61 although the latter was compiled with Open MP in libnabo, where as I explicitly disabled it so far.

As for the data protection, this algorithm is setting mapBuildingInProgress = true and it won't launch the thread so long as this is not reset to false, so joining is not really useful. It could be done in processNewMapIfAvailable which sets the flag back to false and could call a join. The result is the same.

To be completely safe though, mapBuildingInProgress = true should be moved up before starting the thread, there is a minor race condition there, independently of using join or detach.

HannesSommer commented 6 years ago

Thanks a lot for having a closer look! You've convinced me. There is no need for a join then.

I also agree with the little race condition. But reordering would probably not be enough, because setting to true would need to be atomic together with retrieving the old value in order to be truly safe (otherwise two threads could see the false at the same time and both set it to true afterwards) http://en.cppreference.com/w/cpp/atomic/atomic_flag/test_and_set would be perfect for the job, but we don't use c++11 here, yet :(.

However, I have quite some doubt that the rest of Mapper::processCloud can be safely called in parallel (e.g. because it seems to use a shared libpointmatcher ICP object, which I believe to be unprotected). Until it is we probably don't have to worry about this race condition.

Does somebody know about the design goals here? Was Mapper::processCloud intended to be called possibly in parallel? Is it usually?

There seems to be another related minor bug: the Mapper-destructor waits for the mapBuildingFuture in https://github.com/ethz-asl/ethzasl_icp_mapping/blob/877a67471ab0b0d91b6a40e923a11e25c5306259/ethzasl_icp_mapper/src/mapper.cpp#L276 iff mapBuildingInProgress. This looks to me as if setting it to true before or after creation just chooses whether here we get potential exception (wait on a non existing future) or a resource leak (if it does not wait). It seems to me there is quite some locks missing in this implementation.

cedricpradalier commented 6 years ago

The test and set is not really critical here because the only threads testing the value are either the thread creator or the protected thread. In the worst cast, what would happen is that the k-th mapBuildingThread will return while the (k+1)-th starts. There is no possible actions during these overlaps and no consequence. Also in terms of protection, the only thing these threads are doing is setMap(updateMap) and they are the only one doing it. I assume that whoever implemented these actions in background made sure that they were thread-safe...

HannesSommer commented 6 years ago

So you assume already that there can't be two creators (Mapper::processCloud) in parallel?

Otherwise, I cannot follow you, as the (design) purpose of the flag is to have no more than one mapBuildingThread at a time, which it cannot fulfill safely without atomic test and set.

Once we accept two creators it can also happen that there are two mapBuildingThread starting at almost the same time (created in parallel). And they share for instance the unprotected transformation object.

I would not at all be as optimistic concerning thread safety :).

But I also don't believe that Mapper::processCloud is actually called in parallel - I just don't know it.

cedricpradalier commented 6 years ago

In the context mapper and dynamic_mapper, the ROS spinner is a normal one that guarantees that callbacks are called sequentially. Mapper::processCloud is called only from the ROS callbacks, hence you cannot have two instances of Mapper::processCloud running concurrently.

So yes, as long as no ros::AsyncSpinner is used, we can safely assume that processCloud runs alone.

HannesSommer commented 6 years ago

Thanks. Then let's assume that for the moment (it would probably fail at other places otherwise).

But then I would actually oppose moving the = true up as in #65 (at least without further changes):

Given the assumption it does not help Mapper::processCloud. But it could cause trouble in other functions, such as: https://github.com/ethz-asl/ethzasl_icp_mapping/blob/877a67471ab0b0d91b6a40e923a11e25c5306259/ethzasl_icp_mapper/src/mapper.cpp#L556

or the ~Mapper.

This is because moving it up might cause it (more probable than before) to be executed before the assignment of the future in https://github.com/ethz-asl/ethzasl_icp_mapping/blob/ec96db832be5ad4cf95e2551ef33797903a39bbb/ethzasl_icp_mapper/src/mapper.cpp#L479 (through compiler optimization since both variables are not volatile!)

If the order gets reversed it can happen that the future gets waited for before its first assignment, which should throw an exception as far as I know.

Instead we have already good experience with the old position of the assignment (the additional detach does not change the concurrency after all).

cedricpradalier commented 6 years ago

OK, I updated the pull-request following your suggestion.

HannesSommer commented 6 years ago

Done with PR #65.