chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

proposal: language support for critical sections and aggregation #12306

Open mppf opened 5 years ago

mppf commented 5 years ago

This issue asks if we should add some language support for critical sections and if such support could enable smooth expression of certain aggregation patterns.

What do other languages use for critical-section (or similar) syntax?

Python

Python context manager (see e.g. https://en.wikibooks.org/wiki/Python_Programming/Context_Managers ).

What might this look like in Chapel? (Supposing there is a lock library - see #9881).

var myLock: Lock;
with myLock {
  // lock is taken on entry to this block
  criticalSection();
  // lock is automatically released on exit from this block
  // even if exiting the block due to an error exit
}

C++ / RAII

Now, as noted in #10341, we could already write this using an RAII data type:

var myLock: Lock;
{
  var locked = new HeldLock(myLock); // initialization takes lock
  criticalSection();
  // `locked` deinit releases lock automatically on exit from this block
  // even if exiting the block due to an error exit
}

However this pattern seems a bit awkward in this case because the name of the variable locked is irrelevant. (As is the fact that it's a variable at all).

Java

In Java, critical sections are easy to write with the synchronized block. Here's what that might look like in Chapel:

synchronized(myLock) { // note, in Java you could pass an arbitrary object there, not just a lock
  // lock taken on entry to block
  criticalSection();
  // lock released on exit
}

Further motivation

Why am I interested in this question?

Let's consider 3 cases:

Suppose we had syntax like sync on <object> { body } that would:

The implementation of sync on would be similar to the Python with context manager in that it calls methods on the noted object in order to create the synchronization. For now, we'll supposed that it calls syncOnEnter and syncOnExit methods.

Critical section with lock object

var myLock: Lock;
sync on myLock {
  // lock taken on entry to block
  criticalSection();
  // lock released on exit
}

We could consider allowing this to be written sync myLock { } in the event that the programmer did not want to move execution when mylock is remote (which would be an unusual idiom).

The Lock type would simply support syncOnEnter that takes the lock and syncOnExit that releases the lock. The compiler would arrange to call these in the block entry and exit.

For example, the compiler could translate it into:

var myLock: Lock;
{
  myLock.syncOnEnter();
  defer { myLock.syncOnExit() }
  criticalSection();
}

Populating a distributed set in parallel

What about the case of populating a distributed set in parallel? This is the case that https://chapel-lang.org/CHIUW/2016/Ferguson-slides.pdf studied, but the problem with the work described there is that it wasn't general enough - it involved writing handlers for the specific case of distributed set construction. Can we do better with some higher level language ideas?

Here is a sketch of what the code would look like:

var Pairs: domain( (int, int) ) dmapped Hashed(idxType=(int, int), locking=false);
forall (i,j) in AvailablePairs {
  sync on Pairs[(i,j)] {
    Pairs += (i,j);
  }
}

Here the idea is that the sync on would:

Additionally sync the sync on is the last statement in the forall, it can be done in an unordered manner, following #12150, so the (i,j) elements to be added can be collected and distributed in bulk.

What would the implementation look like?

First, note that the Pairs HashedDomain would include syncOnEnter() and syncOnExit() methods that accept a key (so it can do locking per hashtable row if desired). The compiler would arrange to pass (i,j) as the argument to these in this particular case. It could thus translate it into

var Pairs: domain( (int, int) ) dmapped Hashed(idxType=(int, int), locking=false);
forall (i,j) in AvailablePairs {
  {
    Pairs.syncOnEnter( (i,j) );
    defer { Pairs.syncOnExit( (i,j) ); }
    Pairs += (i,j);
  }
}

But the real fun begins if the last-statement-forall-unordered optimization fires (#12150 / #12269). In that event, the compiler and module code will aggregate by translating it into:

var Pairs: domain( (int, int) ) dmapped Hashed(idxType=(int, int), locking=false);
forall (i,j) in AvailablePairs {
  bundle = create-on-statement-bundle(syncOnFn,Pairs,i,j)

  enqueueInTaskLocaleDestinationBuffer(Pairs, (i,j), syncOnFn, bundle);
  // at end of task, flushTaskLocaleDestinationBuffer
}
// this is an outlined function created by the compiler
proc syncOnFn(Pairs, i, j) {
  Pairs += (i,j); // no locking needed here
}

// sketch of buffering - hopefully most of this is module code
proc enqueueInTaskLocaleDestinationBuffer( dstObject, dstKey, fn, bundle) {
  dstLocale = dstObject.syncOnReceiver( dstKey )
  enqueue per-task dstLocale array (dstObject, dstKey, fn, bundle)
}

// sketch of how the flushing would work - hopefully most of this is module code
proc flushTaskLocalDestinationBuffer() {
  // sort destination buffer by target locale and then by target function
  coforall targetLocale in destinationLocales {
    on targetLocale {
      GET the bundles from the destination buffer in chunks with the same dstObject, dstKey
       dstObject.syncOnEnter(dstKey);
       for (dstObject, dstKey, fn, bundle) in chunk with same dstObject, dstKey {
         fn( unwrapped bundle-args);
       }
       dstObject.syncOnExit(dstKey);
      }
    }
  }
}

If this strategy is successful thensync on becomes common usage for updating a domain or array, and I think it would make sense to completely remove the locking argument from DefaultAssociative and HashedDom/Arr. That solves the problem of needing to know not to take the lock inside of Pairs. (However we might nonetheless need a transitional strategy).

Updating a distributed histogram

var DistributedIds: domain( int ) dmapped Hashed(idxType=int);
var Counts:[DistributedIds] int;

forall i in AvailableIds {
  sync on Counts[i] {
    DistributedIds += i;
    Counts[i] += 1;
  }
}

Here the idea is that the sync on would:

As above, the the sync on is the last statement in the forall, it can be done in an unordered manner, so the histogram updates can be aggregated and sent out in bulk.

This case is similar to the above (with slight differences in what arguments are bundled into the syncOnFn). The syncOnFn would consist of:

proc syncOnFn(Pairs, i, j) {
  DistributedIds += i; // no locking needed here
  Counts[i] += 1; // or here
}

Other potential examples

mppf commented 4 years ago

Another related idea is to extend on MyDataStructure to run say proc DataStructure.on(...) and then it would be relatively straightforward to support such patterns.

mppf commented 3 years ago

@dlongnecke-cray - it would be interested to include some ideas from this issue as we develop a menu of choices to make when designing a context manager.

vasslitvinov commented 2 years ago

Here is a way to provide this feature using a manage statement, as proposed by @mppf, using the distributed histogram as an example. First off, if the user tries to take a ref of the histogram element, it will be disallowed because the data structure does not support it:

forall idx in inputs {
  Counts[idx] += 1;   // error: "use 'manage' to update elements
}

The user needs to use this style:

forall idx in inputs {
  manage Counts(idx) as element {
    element += 1;
  }
}

The compiler queries Counts for the availability of a designated API method. If successful, it converts the forall loop to aggregation style:

forall idx in inputs with 
  (var agg = Counts.createAggregator(update=lambda(idx,element){element += 1;})) 
{
  agg.add(idx);
}

The aggregator will batch up and send out the updates by the time its task completes, at the latest. The destination locale will perform the updates in batch by applying the updating lambda to each element. If a given idx is not yet in the index set of Counts, it will add that index and default-initialize the corresponding element before applying the lambda to it.

One proposal is to make the pattern manage Counts(idx) as element { ... do something with element ... } be the standard way to operate on an element of a parallel, possibly distributed, data structure. If it occurs at a location other than the last statement of a forall or task-parallel construct, it should add/update the element atomically without aggregation.

A variation on this proposal is to allow a data structure to delay the propagation of updates to remote locales -- and perhaps even to other tasks on the same locale -- until a memory fence is executed in accordance to Chapel's memory consistency model. This would be part of the semantics of the data structure. To implement this semantics, a manage Counts(idx) statement would cause the compiler to obtain a handle from Counts so that a designated API method is invoked on this handle when the handle's task performs a memory fence. This handle can be created at the start of the task or at the declaration of Counts, whichever comes later. In this case the manage statement can perform aggregation as well, except this time instead of the updating lambda it will store the new value for each "managed" index.

[Correction to the previous paragraph 3/24] The complexity of obtaining a handle may not be needed. A notification of a memory fence could go directly to Counts via an API method.

vasslitvinov commented 2 years ago

An alternative way to perform bulk updates is to use a designated method passing it a lambda:

Counts.bulkUpdate(inputs, lambda(idx,element){element += 1;});

The advantages of these approaches are:

mppf commented 2 years ago

A couple of other points:

A variation on this proposal is to allow a data structure to delay the propagation of updates to remote locales

I call this "data structure consistency". Another element of it is that I expect that if you, say, add x to a distributed set, and then query from that same task if x is in the set, it would say "yes". But the set element might belong on a remote locale and it would just check the queued-up operations to know that x was added. In this way it is similar to --cache-remote.

In any case, I think that the understandability of the programs is still improved by having it be clear when the task of updating an element is completed. In particular, if we imagine a "data structure consistency" approach that allows references, one could write

ref x = MyDistributedMap[i];
fence();
x += 1;

which I think is a challenging case to handle correctly in the implementation - meaning I think getting all of this right would be easier if that pattern is simply not possible.

vasslitvinov commented 2 years ago

If we go with the sync syntax, how does it know that (i,j) is the index that the lock is being held for in the body of sync ? For example:

sync on Pairs(i,j) {
  Pairs += (i,j);
  Pairs.add(i,j);
  Pairs(i,j) = k; // if this were meaningful
  Pairs.remove(i,j);
  Pairs.doWhoKnowsWhat(a,i,j,b);
}

My proposal: