proposal: language support for critical sections and aggregation

This issue asks if we should add some language support for critical sections and if such support could enable smooth expression of certain aggregation patterns.

What do other languages use for critical-section (or similar) syntax?

Python

Python context manager (see e.g. https://en.wikibooks.org/wiki/Python_Programming/Context_Managers ).

What might this look like in Chapel? (Supposing there is a lock library - see #9881).

var myLock: Lock;
with myLock {
  // lock is taken on entry to this block
  criticalSection();
  // lock is automatically released on exit from this block
  // even if exiting the block due to an error exit
}

C++ / RAII

Now, as noted in #10341, we could already write this using an RAII data type:

var myLock: Lock;
{
  var locked = new HeldLock(myLock); // initialization takes lock
  criticalSection();
  // `locked` deinit releases lock automatically on exit from this block
  // even if exiting the block due to an error exit
}

However this pattern seems a bit awkward in this case because the name of the variable locked is irrelevant. (As is the fact that it's a variable at all).

Java

In Java, critical sections are easy to write with the synchronized block. Here's what that might look like in Chapel:

synchronized(myLock) { // note, in Java you could pass an arbitrary object there, not just a lock
  // lock taken on entry to block
  criticalSection();
  // lock released on exit
}

Further motivation

Why am I interested in this question?

It's related to creating aggregation optimization (e.g. #10386). In particular I'm looking for a way to express code that can have aggregation optimization applied without baking it in to the code. This also relates to my work on label propagation (https://chapel-lang.org/CHIUW/2016/Ferguson-slides.pdf).
In working on issue #12150 / PR #12269, I needed to detect the case in which an atomic at the end of a forall loop interacts with an atomic earlier in the loop. One reason that this is tricky is that some of our array access code actually uses locks to protect a critical section. If the code pattern for a critical section were very clear, this wouldn't be an issue.

Let's consider 3 cases:

critical section with lock object
adding elements to a distributed set
updating a distributed histogram

Suppose we had syntax like sync on <object> { body } that would:

indicate that the body is a critical section
indicate that the critical section can be run on the locale owning <object>

The implementation of sync on would be similar to the Python with context manager in that it calls methods on the noted object in order to create the synchronization. For now, we'll supposed that it calls syncOnEnter and syncOnExit methods.

Critical section with lock object

var myLock: Lock;
sync on myLock {
  // lock taken on entry to block
  criticalSection();
  // lock released on exit
}

We could consider allowing this to be written sync myLock { } in the event that the programmer did not want to move execution when mylock is remote (which would be an unusual idiom).

The Lock type would simply support syncOnEnter that takes the lock and syncOnExit that releases the lock. The compiler would arrange to call these in the block entry and exit.

For example, the compiler could translate it into:

var myLock: Lock;
{
  myLock.syncOnEnter();
  defer { myLock.syncOnExit() }
  criticalSection();
}

Populating a distributed set in parallel

What about the case of populating a distributed set in parallel? This is the case that https://chapel-lang.org/CHIUW/2016/Ferguson-slides.pdf studied, but the problem with the work described there is that it wasn't general enough - it involved writing handlers for the specific case of distributed set construction. Can we do better with some higher level language ideas?

Here is a sketch of what the code would look like:

var Pairs: domain( (int, int) ) dmapped Hashed(idxType=(int, int), locking=false);
forall (i,j) in AvailablePairs {
  sync on Pairs[(i,j)] {
    Pairs += (i,j);
  }
}

Here the idea is that the sync on would:

execute on the Hashed distribution locale owning (i,j)
execute under a lock protecting consistency for Hashed

Additionally sync the sync on is the last statement in the forall, it can be done in an unordered manner, following #12150, so the (i,j) elements to be added can be collected and distributed in bulk.

What would the implementation look like?

First, note that the Pairs HashedDomain would include syncOnEnter() and syncOnExit() methods that accept a key (so it can do locking per hashtable row if desired). The compiler would arrange to pass (i,j) as the argument to these in this particular case. It could thus translate it into

var Pairs: domain( (int, int) ) dmapped Hashed(idxType=(int, int), locking=false);
forall (i,j) in AvailablePairs {
  {
    Pairs.syncOnEnter( (i,j) );
    defer { Pairs.syncOnExit( (i,j) ); }
    Pairs += (i,j);
  }
}

But the real fun begins if the last-statement-forall-unordered optimization fires (#12150 / #12269). In that event, the compiler and module code will aggregate by translating it into:

var Pairs: domain( (int, int) ) dmapped Hashed(idxType=(int, int), locking=false);
forall (i,j) in AvailablePairs {
  bundle = create-on-statement-bundle(syncOnFn,Pairs,i,j)

  enqueueInTaskLocaleDestinationBuffer(Pairs, (i,j), syncOnFn, bundle);
  // at end of task, flushTaskLocaleDestinationBuffer
}
// this is an outlined function created by the compiler
proc syncOnFn(Pairs, i, j) {
  Pairs += (i,j); // no locking needed here
}

// sketch of buffering - hopefully most of this is module code
proc enqueueInTaskLocaleDestinationBuffer( dstObject, dstKey, fn, bundle) {
  dstLocale = dstObject.syncOnReceiver( dstKey )
  enqueue per-task dstLocale array (dstObject, dstKey, fn, bundle)
}

// sketch of how the flushing would work - hopefully most of this is module code
proc flushTaskLocalDestinationBuffer() {
  // sort destination buffer by target locale and then by target function
  coforall targetLocale in destinationLocales {
    on targetLocale {
      GET the bundles from the destination buffer in chunks with the same dstObject, dstKey
       dstObject.syncOnEnter(dstKey);
       for (dstObject, dstKey, fn, bundle) in chunk with same dstObject, dstKey {
         fn( unwrapped bundle-args);
       }
       dstObject.syncOnExit(dstKey);
      }
    }
  }
}

If this strategy is successful thensync on becomes common usage for updating a domain or array, and I think it would make sense to completely remove the locking argument from DefaultAssociative and HashedDom/Arr. That solves the problem of needing to know not to take the lock inside of Pairs. (However we might nonetheless need a transitional strategy).

Updating a distributed histogram

var DistributedIds: domain( int ) dmapped Hashed(idxType=int);
var Counts:[DistributedIds] int;

forall i in AvailableIds {
  sync on Counts[i] {
    DistributedIds += i;
    Counts[i] += 1;
  }
}

Here the idea is that the sync on would:

execute on the Hashed distribution locale owning i
execute under a lock protecting Counts[i] as well as consistency for DistributedIds (The assumption here is that this can be protected by the same lock that the domain already supports).

As above, the the sync on is the last statement in the forall, it can be done in an unordered manner, so the histogram updates can be aggregated and sent out in bulk.

This case is similar to the above (with slight differences in what arguments are bundled into the syncOnFn). The syncOnFn would consist of:

proc syncOnFn(Pairs, i, j) {
  DistributedIds += i; // no locking needed here
  Counts[i] += 1; // or here
}

Other potential examples

distributed tree construction
- interesting because the objects being added to (i.e. the tree nodes) are created during the traversal
distributed MSB-first radix sort with per-key destination buffers / vectors
- easy enough, we want to append to the per-key buffers in any order

Another related idea is to extend on MyDataStructure to run say proc DataStructure.on(...) and then it would be relatively straightforward to support such patterns.

@dlongnecke-cray - it would be interested to include some ideas from this issue as we develop a menu of choices to make when designing a context manager.

Here is a way to provide this feature using a manage statement, as proposed by @mppf, using the distributed histogram as an example. First off, if the user tries to take a ref of the histogram element, it will be disallowed because the data structure does not support it:

forall idx in inputs {
  Counts[idx] += 1;   // error: "use 'manage' to update elements
}

The user needs to use this style:

forall idx in inputs {
  manage Counts(idx) as element {
    element += 1;
  }
}

The compiler queries Counts for the availability of a designated API method. If successful, it converts the forall loop to aggregation style:

forall idx in inputs with 
  (var agg = Counts.createAggregator(update=lambda(idx,element){element += 1;})) 
{
  agg.add(idx);
}

The aggregator will batch up and send out the updates by the time its task completes, at the latest. The destination locale will perform the updates in batch by applying the updating lambda to each element. If a given idx is not yet in the index set of Counts, it will add that index and default-initialize the corresponding element before applying the lambda to it.

One proposal is to make the pattern manage Counts(idx) as element { ... do something with element ... } be the standard way to operate on an element of a parallel, possibly distributed, data structure. If it occurs at a location other than the last statement of a forall or task-parallel construct, it should add/update the element atomically without aggregation.

A variation on this proposal is to allow a data structure to delay the propagation of updates to remote locales -- and perhaps even to other tasks on the same locale -- until a memory fence is executed in accordance to Chapel's memory consistency model. This would be part of the semantics of the data structure. To implement this semantics, a manage Counts(idx) statement would cause the compiler to obtain a handle from Counts so that a designated API method is invoked on this handle when the handle's task performs a memory fence. This handle can be created at the start of the task or at the declaration of Counts, whichever comes later. In this case the manage statement can perform aggregation as well, except this time instead of the updating lambda it will store the new value for each "managed" index.

[Correction to the previous paragraph 3/24] The complexity of obtaining a handle may not be needed. A notification of a memory fence could go directly to Counts via an API method.

An alternative way to perform bulk updates is to use a designated method passing it a lambda:

Counts.bulkUpdate(inputs, lambda(idx,element){element += 1;});

The advantages of these approaches are:

bulkUpdate/lambda: no additional API/interactions between the data structure and the compiler; aggregation occurs internally to bulkUpdate method
manage: can update multiple data structures simultaneously within the same forall loop -- using nested manage statements
manage: no need to learn lambdas

A couple of other points:

the compound operators idea (see https://github.com/chapel-lang/chapel/issues/17999#issuecomment-976579379 ) provides a mechanism to allow simple cases to be supported by the data structure without exposing references and without requiring the user write a managed block / sync on / lambda. So it is appealing to me as something we can consider adding in, to allow simple cases to remain simple to express.
one important thing we are getting from a lambda or managed block is that we know when the update is done (which is the main thing that #17999 is asking after). I think that if we are aggregating we need to only send updates that have completed. (Of course, it might be possible to build something that does not work that way, but is it too complex to understand or to implement?).

A variation on this proposal is to allow a data structure to delay the propagation of updates to remote locales

I call this "data structure consistency". Another element of it is that I expect that if you, say, add x to a distributed set, and then query from that same task if x is in the set, it would say "yes". But the set element might belong on a remote locale and it would just check the queued-up operations to know that x was added. In this way it is similar to --cache-remote.

In any case, I think that the understandability of the programs is still improved by having it be clear when the task of updating an element is completed. In particular, if we imagine a "data structure consistency" approach that allows references, one could write

ref x = MyDistributedMap[i];
fence();
x += 1;

which I think is a challenging case to handle correctly in the implementation - meaning I think getting all of this right would be easier if that pattern is simply not possible.

If we go with the sync syntax, how does it know that (i,j) is the index that the lock is being held for in the body of sync ? For example:

sync on Pairs(i,j) {
  Pairs += (i,j);
  Pairs.add(i,j);
  Pairs(i,j) = k; // if this were meaningful
  Pairs.remove(i,j);
  Pairs.doWhoKnowsWhat(a,i,j,b);
}

My proposal:

The object returned from enterThis (see the manage-based implementation idea above https://github.com/chapel-lang/chapel/issues/12306#issuecomment-1078461074) is a "callback" into the data structure (Pairs in the above example), or we can call it "marker", "representative", "stand-in", ... If the key is not in the data structure when enterThis returns, it is NOT added implicitly and the callback object is still produced.
Compiler matches i,j from sync header against the argument list of each function call in sync body. If there is a match, the compiler attempts to resolve a modified version of the call, where the actuals i,j are replaced with a named actual callback=(the callback object). If it resolves, then that call replaces the original call. It is up to the data structure to perform the operation using the callback object. The compiler guarantees that this call is performed within the (implicit) manage statement. This approach does not handle Pairs += (i,j), I think it is OK.
Instead of replacing i,j, we could just add the callback=... argument. This may add unnecessary cost of passing i,j when they are of complicated types, or may simplify the data structure implementation. The compiler could try both variants.
Compiler ensures that i and j do not change throughout sync body. For example, they are const and not const ref to something that can change.

chapel-lang / chapel