Improve thread-support of DART/DASH

dash-project / dash

DASH, the C++ Template Library for Distributed Data Structures with Support for Hierarchical Locality for HPC and Data-Driven Science

http://www.dash-project.org/

Other

154 stars 44 forks source link

Improve thread-support of DART/DASH #118

Open devreal opened 7 years ago

devreal commented 7 years ago

The following features in DART and DASH should be improved to support thread-parallel access:

Interface to control thread-aware initialization of the underlying communication library (mainly MPI)
Communication operations (mainly put/get and atomic operations, collective operations will not be thread-safe due to restrictions in MPI)
Global memory allocation (needs protection of internal shared data structures in DART)
Team management may not be required to be thread-safe but we need to ensure not to use in potentially thread-parallel code-paths
Improve documentation of which operations or operation combinations are thread-safe and which are not

See also #109.

fuchsto commented 7 years ago

I will take care of the global memory allocation and team management aspects. We also should think about meaningful tests / verification measures.

There is a light-weight, convenient unit test framework for thread-parallel test cases called partest from the siemens/embb ecosphere that I really like. I don't know if it is readily available to include it in distribution, but it would be great if we could integrate it in our "private" CI at least.

Oh, and there is Divine for explicit state model checking, but I found it only practicable for very minimal, stripped-down use cases. Even then, verification usually takes the better part of a day.

(Then again, we have lots of CPU hours to spare on SuperMUC)

devreal commented 7 years ago

[UPDATE: Added third point]

I created a branch (bug-118-threadsafety) to deal with this and started investigating the road-blocks in the global memory allocation on the DART side. There are two issues right now:

The segment management is using a global shared hash table that is used to map segment IDs from DART gptrs to the segment information. There are three possible solutions:
- Use lock-free lists inside the hash table buckets. This can be done using _Atomic in C11 but will cause some pain.
- Use (spin-)locks to protect the global data-structure.
- Since we only have 16-bit segment IDs and they are not reused we could also pre-allocate information for all segments, which would be around 3MByte with the current data-structure. That is the easiest, fastest, but also least efficient method ^^
The buddy allocator used for non-aligned allocations is not thread-safe. See #220
There are some integral shared state variables as well. In the case of global memory, it's dart_memid. They can be handled through atomics. I will start implementing an abstraction for integral atomic operations after the holidays.

fuchsto commented 7 years ago

Fortunately, these issues are anything but exotic. These are classic parallel memory management problems. Not too long ago I was lucky enough to be involved in a project at Siemens CT which addresses these challenges for embedded systems, called EMBB. It contains a very, very stable and portable atomic (CAS, FAA) interface for C99 and also provides lock-free and wait-free containers, but only for C++.

We don't have mandatory library dependencies and we should keep it that way, but I recommend to have a look at EMBBs concepts and methods before working on thread-safe allocation:

http://github.com/siemens/embb.git (official)
http://github.com/fuchsto/embb.git (some experimental fanciness but dated)

@devreal Ah, I already sang the EMBB song to you when we discussed task models, I remember.

devreal commented 7 years ago

I started to work on making the local and global memory allocation in DART thread-safe. So far, I have used mutexes (pthread_mutex) to ensure mutual exclusion inside the buddy allocator and inside the segment data management. There is a new build variable that can be used to switch off threading support, which also eliminate the overhead from the mutexes. The increment/decrement of segment counters is done using atomics. We might want to provide non-atomic fall-backs in case the above-mentioned build-variable is switched off.

I also added dart_init_thread, which calls MPI_Init_thread and returns the provided threading support (reduced to either single or multiple).

With this setup, we rely on the user to not use DART from within a threaded environment if the underlying MPI library does not provide threading support, i.e., there is no explicit exclusion inside DART communication routines.

Please have a look at https://github.com/dash-project/dash/compare/bug-118-threadsafety. I have not created a pull request because the documentation does not reflect the changes, yet.

devreal commented 7 years ago

I added documentation on the updated thread-safety restrictions for the DART part and fixed some issues. In the branch, dash::initialize now calls dart_init_thread if compiling with -DDASH_ENABLE_THREADING and provides the thread-support status through bool dash::is_threaded.

I also added documentation for the methods in Init.cc.

From my point of view, this is all that is needed for DART (unless I missed a critical point, that is). As it stands right now, any DASH/DART operation that only involves communication is thread-safe. Local and global allocation in DART are thread-safe as well. I haven't had a closer look at the road blocks in DASH with respect to thread-safe global memory allocation but I suspect that the bookkeeping in the teams might be unsafe. @fuchsto do you want to look at that or should I have a closer look?

devreal commented 7 years ago

Some more thoughts I had today on thread-safe global memory allocation: (using this issue to document my thoughts)

The issue is more severe than I initially thought. While we can make all access to global data structures involved in global memory allocation thread-safe, team-aligned global memory allocation in the same team is inherently not thread-safe. The reason is that global memory allocation is a collective operation, which in turn is not thread-safe. At a first glance, the MPI standards mandates that collective operations on the same communicator cannot be issued by multiple threads in parallel because communication always happens between two processes, not threads.

However, the problem exists even on a higher level: Assuming two threads allocate aligned global memory on the same team, the runtime system has no way of deciding which thread contributes to which allocation. So while we can ensure mutual exclusion to prevent data races on the lower levels, we cannot guarantee correctness. It will be up to the user of DASH to ensure that no two allocation on the same team happen in parallel. Similar restrictions apply to Team management and the majority of the dash::Algorithms, which use either temporary dash::Arrays or some form of collective operation (barriers, accumulate, etc).

In order to resolve this, we would need some form of asynchronous allocation process, in which allocations carry a globally unique identifier (most likely provided by the user) that is used to collect matching contributions on all units and perform the allocations sequentially. However, this can only be done reliably on allocations that are triggered by the user directly, not for temporary allocations in dash::Algorithm.

Maybe someone has a better thought on this :)

fuchsto commented 7 years ago

Hm, actually, multi-threaded global allocation is not supported intentionally because allocation and thread-safety are conceptually unrelated (as in: orthogonal semantics). It's exactly like in the STL. Your requirements are perfectly sound. They don't concern allocators, though. I suggest we schedule a Skype conference call with @rkowalewski to clarify this.

devreal commented 7 years ago

Mhh, I think the semantics here are different. According to https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_concurrency.html, it reads:

In the terms of the 2011 C++ standard a thread-safe program is one which does not perform any conflicting non-atomic operations on memory locations and so does not contain any data races.

In particular, this means that you can call for example std::find from within multiple threads on the same or different containers. In DASH, you cannot call dash::find_if from within multiple threads in the same team as this would cause a data race on some internal state (the team) that is hidden from the user.

The text at the link above goes on to read:

The user code must guard against concurrent function calls which access any particular library object's state when one or more of those accesses modifies the state. An object will be modified by invoking a non-const member function on it or passing it as a non-const argument to a library function.

We do not pass the team to the DASH algorithms so the user is not aware of the underlying race condition.

But yes, let's schedule a telco on this. I'm in my office all week. Just shoot me an email with the times that work for you.

fuchsto commented 7 years ago

Related: Execution policies as discussed in #104