Enable SPMD programming within a forall loop

mppf commented 4 years ago

This issue proposes that the Chapel language / module code be adjusted to allow for certain SPMD idioms within forall loops. This proposal is intended to meet both the needs of programmers wishing to work in a more SPMD style and also programmers wishing to use Chapel to generate GPU kernel code.

Motivation: What is wrong with SPMD programming in Chapel today?

Chapel already enables SPMD programming with lower level programming, for example:

const numTasksTotal = numLocales * tasksPerLocale;
coforall loc in Locales {
  on loc {
    coforall taskId in 0..#tasksPerLocale {
      const rank = here.id * tasksPerLocale + taskId;
      // SPMD kernel here
    }
  }
}

The main problem with this idiom is that it does not mix well with other Chapel features. In particular, forall loops are a really important feature for parallel iteration that might be defined by an array. The forall loop will do the right thing whether an array being iterated over is Cyclic or Block distributed, e.g.

Since the existing SPMD idiom does not mix well with forall, programmers wishing to use an SPMD mental model (even momentarily) need to write their entire loop nests to use the coforall/on pattern and cannot use forall loops that create work across all of the locales.

Sometimes using an SPMD programming idea is actually an optimization over some alternative but higher level way of expressing an algorithm purely with data parallel operations. Why should programmers trying to add such optimizations have to give up the benefits of a forall such as being able to follow data distribution of an existing parallel iterator or data structure?

Example - Removing Zero Elements in an Array

Suppose that a Chapel programmer has a distributed array. Some of the elements are equal to zero and some are not. The programmer wishes to create a distributed array that stores only the nonzero elements.

This can be done with data-parallel scans over boolean arrays. It can also be done in an SPMD fashion, if the programmer can assume that the array is Block distributed.

data parallel version is less efficient than alternatives (scan is O(n))
SPMD version cannot automatically adapt to different distributions

An example of a data parallel version of code along these lines is uniqueFromSorted in Arkouda: https://github.com/mhmerrill/arkouda/blob/c1ae2a200c0089a188fd570d6d1aa0d320527c7c/src/Unique.chpl#L485

The data parallel version is something along the lines of (note, this is a not-tested sketch):

// Suppose Input is a distributed array of uints with 0s to remove

// +scan to compute segment position...
var iv: [Input.domain] int = (+ scan (if Input != 0 then 1 else 0));
// make scan start from 0
iv -= iv[0];
// compute how many segments
var n = iv[iv.size-1];
var segs: [0..#n] uint; // could/should be block distributed
var Output: [0..#n] uint; // could/should be block distributed
// Store the segment positions in segs (could use an aggregator)
forall i in Input.domain {
  if Input[i] != 0 {
    segs[iv[i]] = i;
  }
}
// Pull out element at segs
[i in segs.domain] Output[i] = Input[segs[i]];

The SPMD version would be similar to the below but use coforalls in all cases instead of foralls.

// Suppose Input is a distributed array of uints with 0s to remove

// Step 1: Count the number of nonzeros on each task
const numTasksTotal = numLocales * tasksPerLocale;
var numNonzeroPerTask: [0..#numTasksTotal] int; // could/should be Block distributed
coforall (loc, locid) in zip (Locales, 0..) {
  on loc {
    coforall t in 0..#tasksPerLocale {
      var mychunk = computeBlock(Input.domain, locid*tasksPerLocale + t);
      var count = 0;
      for i in mychunk {
        if Input[i] != 0 then count += 1;
      }
      numNonzeroPerTask[locid*tasksPerLocale + t] = count;
  }
}

// Step 2: Create the output and compute the starting offsets for each task within it
var StartOffsets = + scan numNonzeroPerTask;
var n = StartOffsets[numTasksTotal-1];
// make scan start from 0
StartOffsets -= StartOffsets[0];
var Output: [0..#n] uint; // could/should be block distributed
coforall (loc, locid) in zip (Locales, 0..) {
  on loc {
    coforall t in 0..#tasksPerLocale {
      var mychunk = computeBlock(Input.domain, locid*tasksPerLocale + t);
      var count = 0;
      var outputIndex = StartOffsets[locid*tasksPerLocale + t]);
      for i in mychunk {
        if Input[i] != 0 {
          Output[outputIndex] = Input[i];
          outputIndex += 1;
        }
      }
    }
  }
}

Proposal

The straw-person proposal here would be that in implementing a forall loop, the compiler and module code would arrange to store, in task-local storage, the locale ID and task ID of each task created in implementing the forall. Here by "task ID" we mean just the 0-based index among tasks created, similar to taskId in coforall taskId in 0..#tasksPerLocale. Then, code inside of a forall loop could use a function to return these IDs and to compute a global task ID for the loop. It is important to this idea that a later forall over the same data create the same mapping of iterations to global task IDs.

This allows for a relatively simple, efficient, and adaptable solution to the problem of removing zero elements in an array. For example:

// Suppose Input is a distributed array of uints with 0s to remove

// Step 1: Count the number of nonzeros on each task
const numTasksTotal = numLocales * tasksPerLocale;
var numNonzeroPerTask: [0..#numTasksTotal] int; // could/should be Block distributed
forall x in Input with (var myGlobalTaskId = globalTaskId()) {
  if x != 0 {
    numNonzeroPerTask[myGlobalTaskId] += 1;
  }
}

// Step 2: Create the output and compute the starting offsets for each task within it
var StartOffsets = + scan numNonzeroPerTask;
var n = StartOffsets[numTasksTotal-1];
// make scan start from 0
StartOffsets -= StartOffsets[0];
var Output: [0..#n] uint;
forall x in Input with (var outputIndex = StartOffsets[globalTaskId()]) {
  if x != 0 {
    Output[outputIndex] = x;
    outputIndex += 1; // TODO: something more complicated if Input is Cyclic?
  }
}

By enabling the forall loop body to use "rank" information:

the loop body can create per-task data structures
the loop body can avoid synchronization

without requiring the programmer to completely move away from forall.

Barriers

A related idea to enable SPMD programming within a forall loop is to allow barriers in the forall loop body. These barriers can only be implemented efficiently if the rank of tasks within the forall loop is known.

gbtitus commented 4 years ago

I gave a thumbs-up, but I would have liked to tilt that thumb just off vertical. For one thing, the function name globalTaskId() seems presumptive, implying that no other kind of global task ID than this one will ever be wanted. I'm not convinced that's true. Maybe SPMDTaskId() instead? Or create an SPMD standard module with a taskId() function and put other things such as an SPMD-specialized barrier in it in the future?

Is an on-stmt that crosses node boundaries allowed within an SPMD loop nest allowed? If so, what does SPMDTaskId() return if called from within such an on-stmt? (I think we want it to return something whose node part is that of the original SPMD task, not the one the on-stmt went to.)

How will this adapt to more complicated node architectures? Currently for locModel=numa a forall-stmt turns into an outer coforall-stmt across nodes, a middle coforall-stmt across NUMA sublocales, and an inner coforall-stmt across PUs. Future architectures may even reach more levels, as hardware architects deal with scale-related fanin/fanout problems as they have in the past, by introducing hierarchy. What about heterogeneity, as in mixed CPU+GPU compute nodes? That would seem lead to "holes" in the numbering scheme. Do those holes have implications for how these IDs are used? (I don't think there are any serious ones, but I could be wrong.) Extending the design to these more complicated node architectures seems possible, maybe even easy, but it would be good to sketch out our expectations.

mppf commented 4 years ago

@gbtitus - thanks for your thoughts!

For one thing, the function name globalTaskId() seems presumptive

At this point I just want to discuss the general direction rather than the exact names or terminology. After all, it is a straw-person proposal rather than a fully developed one. But anyway I'm not particular attached to name or scoping as proposed and something like SPMD.taskID() seems like a strict improvement.

Is an on-stmt that crosses node boundaries allowed within an SPMD loop nest allowed? If so, what does SPMDTaskId() return if called from within such an on-stmt? (I think we want it to return something whose node part is that of the original SPMD task, not the one the on-stmt went to.)

I agree.

That would seem lead to "holes" in the numbering scheme

I think there are harder problems with hierarchical locales / heterogeneity but I would expect that the ID space has no holes in it. That is, if the iterator implementing the forall is making 2 tasks for sockets (say) and within those 4 tasks for cores; then we should just have tasks numbered 0..7 for the "global task ID" part.

To extend to more complex things I'd expect we'd add other things that can be queried - e.g. "My Socket ID", "My Core ID", "My Vector Lane ID". These all would refer to numberings from the forall loop's perspective (and not necessarily match any hardware numbering).

vasslitvinov commented 4 years ago

Here are a couple of requirements that come to my mind.

Sometimes I have to track things within the with-clause of a forall statement, for example:
```
forall ... with (
in A,              // may want to track the call to init=
var B = something  // may want to track that computation
) { ...body... }
```
so I find it desirable for this feature - I will call it locale/task ID - to be available in those contexts as well. And of course consistent with the ID visible from the loop body.
Do we want the locale/task ID to be bound to a given forall statement? I.e. are they available within a function that may or may not be invoked from the forall body? (other than passing them into that function explicitly)
If we want consistent locale/task IDs across multiple forall - when those foralls are over the same data structure / iterator - then they probably should be guided by the parallel iterator and its structure of task creation. Analogously to how forall intents work.
Do we want a single namespace for all task IDs on a given locale? ex. uints, consecutive or not? Or do we want the ID to be hierarchical, reflecting "how we got here", i.e. the structure of subtask creation? ex. "we are in task 3 or 5, created by task 2 of 7, created on locale 2 of 3". Need to allow for the case where the total number of the subtasks that will be created by a given task is unknown when the first few subtasks are being created, i.e. "task 2 of ".
We have considered in the past the notion of "sublocales". For example, a sublocale would represent each subtask or each memory. This makes me think of leveraging here in some way. Perhaps a method on here. Or something fancier?

vasslitvinov commented 4 years ago

P.S. I think that the "locale ID" for SPMD support is specific to the parallel iterator and is data-dependent, so it is possibly/likely different from here.id.

gbtitus commented 4 years ago

... if the iterator implementing the forall is making 2 tasks for sockets (say) and within those 4 tasks for cores; then we should just have tasks numbered 0..7 for the "global task ID" part.

Yeah, that makes more sense than using hardware numbering. So for the heterogeneous case, if we had a 2-wide cobegin across the CPU and GPU of a node and then within that, a 4-wide (say) coforall across the PUs of the CPU and a 64-wide coforall across the streams of the GPU, then we'd have tasks numbered 0..67 for the on-node part of the ID.

To extend to more complex things I'd expect we'd add other things that can be queried - e.g. "My Socket ID", "My Core ID", "My Vector Lane ID". These all would refer to numberings from the forall loop's perspective (and not necessarily match any hardware numbering).

I don't think we can create a hole-free numbering in a heterogeneous situation by looking only at "my" values. But I also don't think holes are necessarily a bad thing, depending on the use model for these IDs. In particular, if a task only ever needs to refer to its own element of any array indexed by these IDs and nobody else's then holes are no big deal, just a small amount of wasted space.

mppf commented 4 years ago

Do we want the locale/task ID to be bound to a given forall statement? I.e. are they available within a function that may or may not be invoked from the forall body? (other than passing them into that function explicitly)

I think so, yes

If we want consistent locale/task IDs across multiple forall - when those foralls are over the same data structure / iterator - then they probably should be guided by the parallel iterator and its structure of task creation. Analogously to how forall intents work.

I think that would be a reasonable approach.

Do we want a single namespace for all task IDs on a given locale? ex. uints, consecutive or not? Or do we want the ID to be hierarchical, reflecting "how we got here", i.e. the structure of subtask creation? ex. "we are in task 3 or 5, created by task 2 of 7, created on locale 2 of 3". Need to allow for the case where the total number of the subtasks that will be created by a given task is unknown when the first few subtasks are being created, i.e. "task 2 of ".

I'd like a single namespace for all task IDs to be available; however I am imagining that would be computed from other information. I think other kinds of information would be good to enable but there are challenges in full generality.

We have considered in the past the notion of "sublocales". For example, a sublocale would represent each subtask or each memory. This makes me think of leveraging here in some way. Perhaps a method on here. Or something fancier?

We could consider here.numTasksThisForall() or something, but this seems to have more to do with the "naming" of it than anything else.

cassella commented 4 years ago

It is important to this idea that a later forall over the same data create the same mapping of iterations to global task IDs.

Is there any concern about cases where the two foralls might create different numbers of tasks? (E.g., because there's some other task running that might exit or create more tasks in the meantime.) If that's considered user error, is there some way the runtime could detect it and report the problem? (All I can think of is the first forall "returns" a token that you have to pass to the later ones, that the runtime can use to check. Two unrelated pairs of foralls could correctly have different amounts of tasks, e.g. if they're over arrays with different numbers of locales.)

vasslitvinov commented 4 years ago

Nice catch @cassella - the number of tasks may differ because of concurrent tasks, and also because it can be data-dependent, ex. it may change if the underlying domain got resized between the foralls.

cassella commented 4 years ago

We could consider here.numTasksThisForall() or something,

What would that return if there are nested forall loops?

cassella commented 4 years ago

forall x in Input with (var myGlobalTaskId = globalTaskId()) {

Hmm. Maybe abstract away from a taskid that's guaranteed to match the tasks created. You want an index for your numNonzeroPerTask[] array that the task is guaranteed no other task is using, and in ideal conditions, each task gets one id. And that each iterand is iterated over with the same id in each forall loop.

My thought is to create something like a, err, "iteration group ID" (IGID). The space of IDs would correspond to the set of task ID's you'd get under ideal conditions -- no other tasks executing, and each index in Input would have the association with the IGID that it would get in those conditions. In less than ideal conditions, the forall loop would create fewer tasks, and so each task's iterations would include more than one IGID. But each Input index would be iterated over with the same IGID, no matter how many tasks you end up with.

In order to size (and distribute) the numNonZeroPerTask, you'd have some way of querying the thing being iterated over.

domain IGIDSpace = getIGIDSpace(Input);
var numNonzeroPerTask: [IGIDSpace] int;
forall (x,igid) in zip(Input, IGIDSpace.IGIDDistributor()) {
  if x != 0 {
    numNonzeroPerTask[igid] += 1;
  }
}

...

forall (x,igid) in zip(Input, IGIDSpace.IGIDDistributor()) {
  outputIndex = StartOffsets[igid]; // Can't rely on "task"-level behavior any more
  if x != 0 {
    Output[outputIndex] = x;
    StartOffsets[igid] += 1;
  }
}

I don't think the follower iterator is in position to do everything I'm relying on it doing. The mapping from index in Input to the igid has to be fixed. So it really needs the leader to not change tasks in the middle of one igid section.

Maybe the IGIDSpace object becomes what you iterate over. Its leader would call the Input leader, intercept each tuple of indices yielded, and rewrite them so that each igid is owned by a single task. (Assertion: if the Input leader gives two tasks elements with the same igid, all elements of that igid may be assigned to either of the tasks, correctness-wise.)

I don't think this requires that the IGIDSpace leader know how many tasks the Input leader will create. The only time the IGIDSpace leader would need to know that is when the Input leader has split a section of igids. In which case there must be another task that has the other portion. The IGIDSpace leader could just know that if the indices it gets includes the first element of an igid section, it owns the whole section (and all other tasks sharing the section know they own none of it), and rewrite its tuple that way.

This does enforce a significant load imbalance if there is one other running task the whole time -- N blocks of work will be divided lumpily among N-1 tasks, so one task will have 2 blocks and the rest 1. Maybe the initial getIGIDSpace() partitions the igids according to how they'd be divided in the current state of affairs, not the ideal state. So if the external tasks remain fixed through both loops, everything divides up naturally.

[edit 11/15 oops, finish this sentence:] Or maybe it creates a lot more igids than natural parallelism would allow, so that such variations lead to less lumpy divisions when the amount of parallelism locally doesn't evenly divide the number of igids.

I don't think this would cope very well with Input.domain being resized, or with being invoked over a dynamic() sort of iter.

mppf commented 4 years ago

the number of tasks may differ because of concurrent tasks,

@ronawho and I talked about simply disabling the data-parallel-loop-depends-on-running-tasks for these loops. Other than that, I don't see any way that foralls over the same iterators could produce different numbers of tasks.

I think the "igids" idea above, which I'd call "notional tasks" or something like that, is interesting. Just to say it in different words, we could have the "id within forall" concept sometimes have multiple ids that actually run on the same task, in the event that helps solve other problems (such as varying the actual number of tasks based on how many tasks are running).

mppf commented 4 years ago

the number of tasks may differ because of concurrent tasks,
simply disabling the data-parallel-loop-depends-on-running-tasks for these loops. Other than that, I don't see any way that foralls over the same iterators could produce different numbers of tasks.

I think this strategy of limiting the number of tasks by the number of running tasks would look very different if we had work stealing of not-yet-started tasks (https://github.com/Cray/chapel-private/issues/1331 is the TODO issue about work-stealing). If we had work stealing, I don't think we'd need to use the number of running tasks at all.

chapel-lang / chapel