Closed rpoyner-tri closed 3 years ago
/cc @amcastro-tri @calderpg-tri @rcory beware of this. I hope to get to fixing it shortly.
Thanks @rpoyner-tri. I was considering having some sort of "Workspace" struct for the newer multibody::contact_solvers::internal::ContactSolver
s that would help store the scratch space they need. We could then place the whole workspace in the cache. Wondering if that'd be a good solution.
@amcastro-tri I'm pretty sure there are races in parameters as well. If you're a fan of big messy text logs I can attach a log of all the suspected-race stack traces here.
Scraping the above log has convinced me that pretty much every data member of TamsiSolver is vulnerable to races. Some thoughts, not necessarily mature yet:
I'm sure there are other possible plans.
Here's my scrape of the above log, indicating source lines of TamsiSolver mentioned in the log.
@rpoyner-tri, taking a look at your log above in detail this is what I understand:
Is this understanding correct?
@amcastro-tri correct. When thinking about threading and data races, a good place to start is call stacks. The call stack is the memory that contains all function arguments, and all local (non-static) function variables. Each thread has its own separate call stack (for everything called (recursively) from the thread's main function), so arguments and local variables (in a sane world -- no dangerous pointer games, say) can't create races.
The trouble starts with longer-lived data; memory owned by a process' main thread and exposed to other threads, or memory in heap blocks that are visible to multiple threads. Often, drake System classes are created by one thread and shared to other threads in the context-per-thread usage pattern. That's where races can come from; if we have member data of System classes that participate in value computations. It's perfectly fine for System classes to have member data that express structure; input/output ports, etc.
What is appealing about the drake context-per-thread usage pattern is that it is very clear how to make some data thread-safe: put it into the context. Of course contexts themselves are far from simple, but that is a topic for another day. :)
I love the explanation in terms of call stacks, very clear. Thanks!
Noticed while working on https://github.com/RobotLocomotion/drake/pull/15644: Integration of TamsiSolver into MultibodyPlant is not thread-per-context safe.
For a quickly hacked up failing test, see here: https://github.com/rpoyner-tri/drake/tree/tamsi-thread-hazard Grab the branch and run it with: