chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.77k stars 417 forks source link

[GSoC 2019] Distributed Non-Blocking Algorithms and Data Structures #13690

Closed dgarvit closed 5 years ago

dgarvit commented 5 years ago

As part of Google Summer of Code 2019, both @LouisJenkinsCS and I have laid down the infrastructure for creating non-blocking algorithms and data structures in both shared and distributed memory. This issue is intended to discuss the API and user-facing semantics of said infrastructure, as well as discuss our decision choices and answer any questions that come our way. This effort has spawned three individual contributions to chapel: 1) LocalAtomics, a user-facing abstraction for performing atomics on arbitrary unmanaged objects, including objects which happen to be remote; 2) EpochManager and DistributedEpochManager, a scalable shared-memory and distributed-memory implementation of epoch-based reclamation that enable concurrent-safe memory reclamation of objects arbitrary unmanaged objects; 3) LockFreeQueue, a shared-memory implementation of Michael and Scott's Queue using the EpochManager which demonstrates its usage and acts as its own stand-alone data structure contribution.

LocalAtomics

The LocalAtomics module is a tested package that provides the ability to perform atomic operations on arbitrary unmanaged objects. To circumvent the infamous ABA problem, we provide an ABA wrapper that contains both the object as well as an ABA counter; all atomic operations on an ABA counter is performed via Intel's CMPXCHG16B assembly instruction. This is used to implement the EpochManager and DistributedEpochManager, and so we provide as a user-facing abstraction.

Example Usage

// Creating a local atomic object;
var atomicObj : LocalAtomicObject(unmanaged object);
var x = new unmanaged object():
var y = new unmanaged object();

// Simple interface
atomicObj.write(x);
var _x = x.read();
assert(_x == x);
x.compareExchange(_x, y);
var z = x.exchange(nil);

// Nearly identical ABA interface
atomicObject.write(x);
var w = atomicObject.readABA();
atomicObject.writeABA(y);
atomicObject.writeABA(x);
// Will fail despite object being identical
assert(atomicObject.compareExchange(w, y) == false); 

Design Decision: Why just unmanaged?

The unmanaged type, unlike shared and owned, are represented as either a narrow pointer (64-bit virtual address) or a wide pointer (128-bit struct consisting of 64-bit virtual address, a 32-bit locale id, and a 32-bit sub-locale id). This is necessary for pointer compression which represents objects as just 64-bit objects.

Epoch Manager

Epoch-Based Memory Reclamation, when decoupled with Read-Copy-Update, is a very powerful and the current state-of-the-art in research literature data structures, and is comparable to other styles of memory reclamation. We chose epoch-based emory reclamation as it offers the ability to both denote lifetimes via 'epochs', in which unlike hazard pointers or shared atomic reference counting, it is not restricted to controlling the lifetime of a single object at a time. In Epoch-Based Reclamation, tasks enter a 'read-side critical section' and enter the current epoch e; objects deleted during e defer deletion in a 'limbo list' assocaited with e, and all objects in said limbo list will not be reclaimed until the current epoch has been advanced to e + 2. The current epoch cannot be advanced from e to e + 1 until there are no tasks in e - 1, ensuring that no task has access to objects deleted in e - 1. The disadvantage is that there is potentially unbounded memory that cannot be reclaimed if a task remains in an epoch for an excess amount of time; this is easily possible in Chapel if the system is oversubscribed and if there is no communication (implicit task yields) or explicit task yielding.

An epoch manager is intended to be restricted to a specific data structure, for example a non-blocking data structure. We offer both a shared-memory optimized variant, EpochManager, as well as a global-view distributed variant DistributedEpochManager that makes use of privatization. Both variants require tasks to 'register' to acquire an owned token, which will 'unregister' the task after the owned token gets reclaimed. The owned token keeps track of the task's current epoch, which can be used to pin to enter a read-side critical section, or to unpin to exit a read-side critical section. Once an object has been logically removed from a data structure, it can then be safely deferred for reclamation via invocation of delete_obj. The 'logical' removal step is important, as it is not thread-safe for multiple threads to call delete_obj on the same object. The delete_obj treats all objects as unmanaged object and it is safe to mix-and-match different types of objects.

Example Usage

class A {}
class B {}
// Create epoch managers
var manager = new EpochManager();
var distManager = new DistributedEpochManager();

// Register
manager.register();
distManager.register();

// Pin
var tok = manager.pin();
var distTok = distManager.pin();

// Delete Object (any type of object)
tok.delete_obj(new unmanaged A());
distTok.delete_obj(new managed A());
tok.delete_obj(new unmanaged B());
distTok.delete_obj(new unmanaged B());

// Unpin 
tok.unpin();
distTok.unpin();

// Forall Loops and Registration
forall x in shmemX with (var tok = tok.register(), var ops : int) {
   tok.pin();
   tok.delete_obj(new unmanaged object());
   tok.unpin();
   ops += 1;
   if ops % 1024 == 0 do tok.try_reclaim();
}
forall x in distX with (var distTok = distTok.register(), var ops : int) {
   // Note: No communication thanks to privatization
   // Also 'global-view distributed data structure'
   distTok.pin();
   distTok.delete_obj(new unmanaged object());
   distTok.unpin();
   ops += 1;
   if ops % 1024 == 0 do distTok.try_reclaim();
}

// Due to privatization and record-wrapping, you _must_ explicit destroy DistributedEpochManager
distManager.destroy();

LockFreeQueue

LockFreeQueue is an implementation of Michael & Scott's Queue in Chapel that makes use of the EpochManager internally. To show the performance benefits of the EpochManaged, we present a small table that shows performance comparisons to the the "Epoch Managed MS-Queue", what is being presented here as a contribution; to the 'Two-Lock' variant of the Michael & Scott Queue that uses Test-And-Set loop and one which uses a Test-And-Test-And-Set loop, both being variants of spinlocks, and finally one recycling memory and using the ABA wrapper introduced in LocalAtomics module.

Test Time
Epoch Managed MS-Queue 37.7992s
Two-Lock (TATAS) MS-Queue 69.8408s
Two-Lock (TAS) MS-Queue 148.661s
Recycled (ABA) MS-Queue 154.375s

The module will automatically try to reclaim after every so often, but also exposes the ability to explicit trigger a 'garbage collection' and forward the epoch via invocation of try_reclaim.

Example Usage

var lfq = new LockFreeQueue(int);
forall i in 1..1024 with (var tok = lfq.getToken()) {
   lfq.enqueue(i, tok);
}
forall i in 1..1024 with (var tok = lfq.getToken()) {
   var (hasElt, elt) = lfq.dequeue(tok);
}
lfq.try_reclaim();

Appendix

ABA Problem

The 'ABA' problem occurs when a task T1 reads the value A from location L, another task T2 writes B to L, and another task T3 writes the value A to L; once T1 checks to see if L has changed, it will incorrectly assume that it has not. To make this more concrete, think of A and B both as a node in a linked list; T1 reads A, T2 allocates a new node B and writes it to L and deletes A, and T3 allocates a new node which just so happens to be the same piece of memory that A had before and writes it to L. Atomic operations such the compareExchange will succeed despite the fact that the nodes are not the same as it will perform the operation based on the virtual address.

LouisJenkinsCS commented 5 years ago

@mppf @ben-albrecht

mppf commented 5 years ago

The owned token keeps track of the task's current epoch, which can be used to pin to enter a read-side critical section, or to unpin to exit a read-side critical section.

I think another way to put this is that any allocated object won't be freed while the epoch manager is pined.

delete_obj

We use camelCase for methods so it should be called deleteObj but I think a better name would be deferDelete.

LouisJenkinsCS commented 5 years ago

I should note that the reason that DistributedEpochManager is privatized is because each locale has its own set of 'limbo lists' that it defers deleted objects to, a cached 'local epoch' to eliminate communication necessary for when a task joins the current epoch, and its own pool of tokens that it manages for registered tasks. Everything here is local, and believe me when I say it is necessary; DistributedEpochManager cannot and will not be made a normal class.

Edit: Performance of DistributedEpochManager reclaiming mixture of remote objects in distributed context

Locales Time (seconds)
1 6.0201
2 10.0265
4 5.68051
8 2.7396
16 1.68223
32 0.762418
64 0.411001

Microbenchmark

    use Time;
    config const numObjects = 16 * 1024 * 1024;
    var objsDom = {0..#numObjects} dmapped Cyclic(startIdx=0);
    var objs : [objsDom] unmanaged object();
    var manager = new DistributedEpochManager();
    forall obj in objs with (var rng = new RandomStream(int)) {
      on Locales[abs(rng.getNext()) % numLocales] do obj = new unmanaged object();
    }
    var timer = new Timer();
    timer.start();
    forall obj in objs with (var tok = manager.register()) {
      tok.pin();
      tok.delete_obj(obj);
      tok.unpin();
    }
    writeln("Done delete_obj");
    manager.clear();
    timer.stop();
    writeln("Time: ", timer.elapsed());
    timer.clear();
    manager.destroy();
  }
}
LouisJenkinsCS commented 5 years ago

After speaking with @mppf, I can agree to emulating owned for the DistributedEpochManager record-wrapper; having an optional isOwned field that the user can opt into (or opt out of), can remove the need to call .destroy() for most use-cases. To ensure that remote-value forwarding nor other implicit copies result in the copies having isOwned=true, we can possibly implement...

1) chpl_serialize and chpl_deserialize to ensure remote-value forwarding always sets isOwned=false. 2) init= and = to ensure that the RHS instances have isOwned=false.

@dgarvit

gbtitus commented 5 years ago

I like the idea. I don't think the documentation should mention CMPXCHG16B, though, or any other low-level implementation details. Documenting implementations at that level leads to one of two forms of sorrow: someone depends on the documentation and thus the implementation and then (1) the implementation changes (as all eventually do) the dependent code breaks or (2) the dependent code turns out to be "important" and the need to keep it working prevents changing the implementation even though it would be nice to be able to do that. So just say that you've solved the ABA problem by doing atomic operations on object+counter pairs, without saying exactly how.

This also brings up: what's the story for ARM? Are you going to just document that this code can't function properly on CPU architectures that don't have a 128-bit compare-exchange capability? Or were you planning on working around that? If not, it might be wise to add some (param-based) code to the implementation so that it refuses to run on such processors.

(Aside: It's not actually "Intel's CMPXCHG16B instruction, by the way. In one of the more embarrassing episodes of computer history, Intel dropped the ball with respect to their very own ISA back in the 1990s and it was AMD that produced the original x86_64 spec. It may have been Intel that produced the first hardware that implemented the instruction, though; glancing through Wikipedia that part wasn't clear.)

LouisJenkinsCS commented 5 years ago

That's a very good point @gbtitus, I didn't consider portability a problem since I assumed cmpxchg16b or something like it was universally available on newer systems. The ABA wrapper is only used currently in the epoch manager data structures but in the future we could possibly use locks when we cannot rely on it (conditionally of course). The epoch manager would be a solution to the ABA problem by itself (preventing reclamation and hence reuse of memory); ABA is provided more as a building block. By the way, how would you check if CMPXCHG16B is unavailable anyway in Chapel? I don't remember there being an environment variable for it.

Interesting tidbit about CMPXCHG16B not being strictly Intel's.

mppf commented 5 years ago

ARM 8.1 has 128-bit CAS https://en.wikichip.org/wiki/arm/armv8.1

gbtitus commented 5 years ago

By the way, how would you check if CMPXCHG16B is unavailable anyway in Chapel? I don't remember there being an environment variable for it.

There isn't one, no. It would have to be an indirect check such as CHPL_TARGET_ARCH=='x86_64'.

dgarvit commented 5 years ago

Given the PR has been merged, can we close this issue?