RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.35k stars 1.27k forks source link

tamsi usage not safe for thread-per-context parallel simulation #15674

Closed rpoyner-tri closed 3 years ago

rpoyner-tri commented 3 years ago

Noticed while working on https://github.com/RobotLocomotion/drake/pull/15644: Integration of TamsiSolver into MultibodyPlant is not thread-per-context safe.

For a quickly hacked up failing test, see here: https://github.com/rpoyner-tri/drake/tree/tamsi-thread-hazard Grab the branch and run it with:

$ bazel test --config=tsan --test_output=streamed //multibody/plant:multibody_plant_reaction_forces_test
rpoyner-tri commented 3 years ago

/cc @amcastro-tri @calderpg-tri @rcory beware of this. I hope to get to fixing it shortly.

amcastro-tri commented 3 years ago

Thanks @rpoyner-tri. I was considering having some sort of "Workspace" struct for the newer multibody::contact_solvers::internal::ContactSolvers that would help store the scratch space they need. We could then place the whole workspace in the cache. Wondering if that'd be a good solution.

rpoyner-tri commented 3 years ago

@amcastro-tri I'm pretty sure there are races in parameters as well. If you're a fan of big messy text logs I can attach a log of all the suspected-race stack traces here.

rpoyner-tri commented 3 years ago

tamsi-thread-hazard-tsan.log

rpoyner-tri commented 3 years ago

Scraping the above log has convinced me that pretty much every data member of TamsiSolver is vulnerable to races. Some thoughts, not necessarily mature yet:

I'm sure there are other possible plans.

Here's my scrape of the above log, indicating source lines of TamsiSolver mentioned in the log.

tamsi-thread-hazard-tsan-code-lines.log

amcastro-tri commented 3 years ago

@rpoyner-tri, taking a look at your log above in detail this is what I understand:

  1. Member variables of TamsiSolver definitely do lead to data races.
  2. Local variables within TamsiSolver's methods do not lead to data races (this is where my knowledge starts failing).

Is this understanding correct?

rpoyner-tri commented 3 years ago

@amcastro-tri correct. When thinking about threading and data races, a good place to start is call stacks. The call stack is the memory that contains all function arguments, and all local (non-static) function variables. Each thread has its own separate call stack (for everything called (recursively) from the thread's main function), so arguments and local variables (in a sane world -- no dangerous pointer games, say) can't create races.

The trouble starts with longer-lived data; memory owned by a process' main thread and exposed to other threads, or memory in heap blocks that are visible to multiple threads. Often, drake System classes are created by one thread and shared to other threads in the context-per-thread usage pattern. That's where races can come from; if we have member data of System classes that participate in value computations. It's perfectly fine for System classes to have member data that express structure; input/output ports, etc.

What is appealing about the drake context-per-thread usage pattern is that it is very clear how to make some data thread-safe: put it into the context. Of course contexts themselves are far from simple, but that is a topic for another day. :)

amcastro-tri commented 3 years ago

I love the explanation in terms of call stacks, very clear. Thanks!