chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 421 forks source link

forall task IDs #16405

Open mppf opened 4 years ago

mppf commented 4 years ago

This issue proposes a mechanism whereby the locale, task, and vector lane position within a forall loop can be queried within that loop or within a follower iterator implementing that loop. The main question here is whether or not this general direction is worth pursuing.

This issue builds upon #14405, CHIP 17, and CHIP 22 and addresses a feature request from this email

Background

Generally speaking a standalone iterator might look like this:

iter myiter(... tag=standalone...) {
  coforall loc in targetLocales {
    on loc {
      const numChunks = _computeNumChunks(...);
      coforall t in 1..numChunks {
        const part = computePartforChunk(...);
        for i in part {
          yield i;
         }
      }
    }
}

This iterator has 3 portions:

  1. Divide work among locales (coforall loc on loc)
  2. Divide work among tasks (coforall t in 1..numChunks)
  3. Indicate the work per task (for i in part)

We are supposing that the 3rd portion also can control vectorization (but see #16404).

In order to enable a SIMT programming model, we would like code invoking this forall to be able to create a variable per-vector-lane and to perform reductions across vector lanes. It is already possible to create a per-task variable:

forall i in myiter() with (var taskLocalVar=0) { ... }

Or to reduce across the tasks:

forall i in myiter() with (+ reduce x) { ... }

Issue ##16403 proposes that these mechanisms be extended to create per-vector-lane variables and per-task variables.

Proposal

In order to support SIMT programming or SPMD programming it is also necessary to allow the author of a forall loop to get task/vector IDs. #14405 talks about one use case for that and CHIP 17 has other examples. In GPU programming this kind of functionality is necessary to get the threads in a warp to combine efforts in order to move data around efficiently. For example, to perform matrix transpose, one might want to copy a block of a matrix to nearby memory; then barrier; and then have the threads collectively store the transpose into the correct memory location.

Note that #16406 discusses the barrier specifically.

Specifically, this issue proposes that each forall loop create and track several numbers in task-local storage:

In addition, forall loops can store details of the iteration space in task-local storage.

These numbers will make sense for data-parallel forall loops - not all forall loops will have meaningful values for these -just as they won't all vectorize.

There are quite a few different possible constructions for parallel iterators. Tasks might be created in sublocales. Perhaps no locales are iterated over. Perhaps tasks are created during iteration.

This iterator that creates tasks in a nested way:

  coforall t1 in chunk1 {
     coforall t2 in chunk2 {
     }
  }

This one creates tasks for sublocales:

  coforall loc in targetLocales {
     on loc {
       coforall subloc in loc.getChildren() {
         on subloc {
            .... yield ...;
         }
      }
   }

At this point we have 2 options:

  1. Create task ids for each level of coforall in which the tasks are created. This has the drawback of creating arbitrary heirarchy in the task ID information that might be difficult for the author of a forall to predict. For example, when using a different locale model, an iterator might create more levels of tasks. The forall might expect something in particular for the type information of the task IDs that does not apply in that configuration.
  2. Flatten the task ids in some manner. This has the drawback of potentially limiting what the user can do in terms of working across different levels of the hierarchy. However at least it is more possible to create code that will work on other configurations. The question is, how many levels of IDs do we consider?

I'm pretty sure we need to know the current "Vector lane" or, in GPU terms, the current "thread within the block".

Matrix Transpose Example

Here is an example of a matrix transpose (from https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/ ).

// matrix transpose setup
const Dom = {1..n, 1..n};
var Input:[Dom] int;
var Output:[Dom] int;
param t = 32; // tile size dimension
// naive matrix transpose
forall (i, j) in Dom {
  Output[j,i] = Input[i,j];
}

This naive matrix transpose won't have good performance on the GPU due to order of memory operations. What if we want to copy the matrix to some scratch storage first in order to do a tiled transpose?

// tiled matrix transpose
forall (i, j) in {1..n by t, 1..n by t} {
  var tile:[0..#t, 0..#t] int;
  tile = Input[i..#t, j..#t];
  Output[j..#t, i..#t] = tile;
}

But, that has the drawback that the forall loop needed to be adjusted. Can we use an SPMD/SIMT idea to perform the tiled copy without adjusting the bounds of the forall? Because each loop iteration can still be responsible for copying one element to the transpose destination, right? (Also, one might want to have a forall over the elements doing multiple things; like perhaps computing on the result of the transpose after it is done).

// tiled matrix transpose
forall (i, j) in Dom {
  const (idX, idY) = Forall.getCurrentVectorLane();
  const (nX, nY) =  = Forall.getNumberOfVectorLanes();
  const tileStartX = i - idX;
  const tileStartY = j - idY;
  "GPU block shared" var tile:[0..#nX, 0..#nY] int; // notional, but somehow, declare one of these per GPU block
  // store into the tile; note that j represents the fastest
  // changing part of the address so we want those reads
  // to be in the least dimension
  tile[idX, idY] = Input[i,j];
  // barrier all vector lanes
  Forall.barrierVectorLanes();
  // store the transpose of the tile
  // but do so in a way where the fastest changing part of the
  // address is the fastest changing part of the loop
  // (so the second dimension of the access is determined by idY)
  Output[tileStartY + idX, tileStartX + idY] = tile[idX, idY];
}

This example only used the last level of the IDs (for vector lanes / GPU block information). Could we do a similar thing for a distributed matrix transpose?

// tiled matrix transpose assuming Dom is distributed
forall (i, j) in Dom {
  const (idX, idY) = Forall.getTaskWithinLocale();
  const (nX, nY) =  = Forall.getNumberOfTasksWithinLocale();
  const tileStartX = i - idX;
  const tileStartY = j - idY;
  "locale shared" var tile:[0..#nX, 0..#nY] int; // notional, but somehow, declare one of these per locale
  // just one task within the locale does a bulk GET for the data,
  // instead of doing GETs for each element
  if  (idX, idY) == (0,0) {
    // do a GET to read into the tile
    tile = Input[tileStartX..#nX, tileStartY..#nY];
    // store the transpose of the tile
    Output[j..#t, i..#t] = tile;
  }
  // barrier the tasks within the locale to make sure the GETs are done
  Forall.barrierTasksWithinLocale();
  // Compute further on the output
}

Open Questions

bradcray commented 3 years ago

While I understand the desire for these types of values, these types of discussions always worry me a bit in that (a) they seem to assume or impose some sort of structure on parallel iterators or at least "well-behaved" (by some definition) parallel iterators; (b) as the issue points out, there may be varying levels of depth within a loop or that one might care about.

At a very, very high level, where my head goes here is that for cases that want to know such "task i of n", "locale j of p" values, is there a way that they could opt into getting that information from the loop itself via a utility iterator. Such that rather than having some built-in identifier that I'd refer to to make such queries (like Forall above), I'd have to opt into it by using such a utility iterator that would yield those values back as part of its interface, along with the original index. So where I might have written forall (i,j) in Dom, if I wanted to know the task / locale mapping I might say something like forall ((i,j), (task, taskSpace), (loc, locSpace)) in queryLocTaskIter(Dom), where taskSpace might be a range from 0..<numTasks and locSpace might be something like a distributed domain's 2D targetLocales domain (so task would be an integer from taskSpace and loc would be a 2D index from locSpace).

I think this has some pretty significant challenges (namely, how do we implement queryLocTaskIter() such that it interacts with the default iterator of Dom?), but those challenges seem similar to the ones we'd have to face in any scheme. Meanwhile, I like that (1) it makes it clearer that these values are being generated/yielded by the iterator and (2) that it makes the invocation of the utility iterator itself something of an opt-in for what the loop expects to depend upon these values (rather than relying on compiler inspection of the loop body to determine the same thing). The main thing I'm not crazy about from a user/usability perspective is that it yields taskSpace and locSpace on each iteration even though they'll be invariant for the whole loop (typically, at least... I suppose if we had a dynamic iterator that added or removed tasks as it ran, taskSpace might change over time...?).

mppf commented 3 years ago

At a very, very high level, where my head goes here is that for cases that want to know such "task i of n", "locale j of p" values, is there a way that they could opt into getting that information from the loop itself via a utility iterator.

In my view, the main drawback of that approach is that it's required to modify the forall loop in order to gather the information (rather than, say, being able to find the information from a called function).

Anyway if we did go in this direction it would make more sense to me to have the iterator wrapper yield some sort of object bundling up the information. (In some proposals I have called this loopConfiguration but here it would be slightly different because it would need to include the current task number - which the other proposals don't necessarily include). That way, it's easy to pass to a subroutine if that is desired.

mppf commented 3 years ago

Just noting that there is some philosophical similarity between this idea and the idea of iterators with "shape" that are currently used to infer the domain in things like var a = myIter() and to convert to an array in parallel.

e-kayrakli commented 3 years ago

@bradcray

At a very, very high level, where my head goes here is that for cases that want to know such "task i of n", "locale j of p" values, is there a way that they could opt into getting that information from the loop itself via a utility iterator. Such that rather than having some built-in identifier that I'd refer to to make such queries (like Forall above), I'd have to opt into it by using such a utility iterator that would yield those values back as part of its interface, along with the original index.

What would you think if we were not using a built-in but a user variable that's used as an intent:

forall (i,j) in Dom with (var forallMagic = new ForallConfig()) { /* bla */ }

where the compiler can initialize forallMagic. Arguably it is still more compiler-driven than what you're suggesting, but maybe less opaque compared to a built-in (what's the builtin going to look like if I had nested foralls and wanted to know about the outer forall from the innermost block)?

A more sci-fi flavor of that could use the config keyword:

forall (i,j) in Dom with (config forallMagic) { /* bla */ }
bradcray commented 3 years ago

At a high level, using an intent-based approach passes the sniff test for me, like the utility iterator approach, in that it relies less on implicit / not visible in the sources variables and links the query to the loop structure being used.

e-kayrakli commented 3 years ago

@mppf -- in your original post and subsequent comments, you sound like you don't like changing the forall "header". Is there any reason for that from performance/functionality or is it more driven by personal taste?

For example, in the tiled matrix transpose example, the first code that doesn't have any of the features that you propose:

// tiled matrix transpose
forall (i, j) in {1..n by t, 1..n by t} {
  var tile:[0..#t, 0..#t] int;
  tile = Input[i..#t, j..#t];
  Output[j..#t, i..#t] = tile;
}

feels pretty clean to me even though you'd have to change the forall iterand. This arguably has a lot more overhead because of slicing. Are you implying that we won't be able to get rid of them and a "lower-level" approach as you propose here would be necessary?

mppf commented 3 years ago

@e-kayrakli - it has to do with the notional idea in Chapel that we can have domain scientists writing "science" and then performance tuning people adjusting how this code runs. In particular we have the idea today that we can write forall loops and then later on adjust how they are running by changing distributions. But the trouble with changing the "forall header" say to encode the tile size prevents us from making use of distribution-like ideas. For example, if the matrix transpose were distributed, we'd want to run the forall across a bunch of locales, but now in the tiled version it is over a new domain. And, now the fact that we needed this tiling at all is encoded in what I would imagine is the "science" part of the program where I think it belongs in the "performance / machine tuning / cs" part

I see this as related to #14405 in that the reason I wanted SPMD programming + forall is so that I can keep a certain amount of flexibility (forall can adapt to different distributions etc) while enabling more optimal expression of certain algorithms (like, removing zeros on that issue).

e-kayrakli commented 3 years ago

@e-kayrakli - it has to do with the notional idea in Chapel that we can have domain scientists writing "science" and then performance tuning people adjusting how this code runs. In particular we have the idea today that we can write forall loops and then later on adjust how they are running by changing distributions. But the trouble with changing the "forall header" say to encode the tile size prevents us from making use of distribution-like ideas. For example, if the matrix transpose were distributed, we'd want to run the forall across a bunch of locales, but now in the tiled version it is over a new domain. And, now the fact that we needed this tiling at all is encoded in what I would imagine is the "science" part of the program where I think it belongs in the "performance / machine tuning / cs" part

I see. That's a fair point. But in this particular example we can think of the by operator being applicable to domains and use that to alleviate some of the issues that you mention, right? So the forall header could look like forall (i,j) in Dom by t or Dom by (t,t)? While saying that I recognize that your high-level concern can still be true in other cases and cannot be addressed by other language changes like this.

mppf commented 3 years ago

Yeah. Adding by for domains has other issues but I think it could apply to this specific case. However it still has the (IMO undesireable) property that the forall loop is indicating the tile size and the body needs to know this tile size. Without things like task IDs/ ability to query block sizes / etc -- the forall loop needs to be modified in the header and the loop body in a coordinated manner to work.

e-kayrakli commented 2 years ago

Based on the discussion above, I think we can introduce some sort of a "context" variable. My current thoughts are below. There's some more thinking to be done, but I wanted to check if I am going in the wrong direction.

A high-level look at what we need

forall (i,j) in Dom do
  <body>

is typically equivalent to some combination of coforall/on/foreach/for statements:

coforall locDom in locDoms do
  on locDom do
    coforall t in tasks do
      foreach (i,j) in myChunk do
        yield (i,j)  // put <body> here

What we need here is a way for <body> to make use of the context of the statements in the outer scope of the loop nest. This context can help answer queries like

Sidebar about GPU block size etc The flow of information I describe above is 1-way -- outer loops' context needs to be trickled down to the inner ones. But when it comes to GPU execution, we would also like to give user the control over the block/grid size of kernel launches. That would mean having some context information from the user's loop being trickled up to the outer ones.

A bottom-up approach

What if there were a special context object that has special semantics when being passed to coforalls and other constructs? Think of this type as having a unique default task intent. This context object can be represented with some sort of a linked structure and can potentially look like:

class Context {
  var inductionVar;
  var iterand;
  var localeSpecificConfig: LocaleSpecConfig;

  var outer: Context?;
  var inner: Context?;
}

record LocaleSpecConfig {  // A GPU specific thing. We can imagine this being a union
                           // where for GPU locales we have GPU-specific info like this
                           // otherwise, they could be vectorization-specific info
  var blockDim: 3*int;
  var blockIdx: 3*int;
  var gridDim: 3*int;
  var gridIdx: 3*int;
}

Going back to the initial example, the sketch could look something like this:

forall ((i,j), context) in Dom do
  <body>

is typically equivalent to some combination of coforall/on/foreach/for statements:

var context: Context;
coforall locDom in locDoms do
  on locDom do
    coforall t in tasks do
      foreach (i,j) in myChunk do
        yield ((i,j), context)  // `context` is incrementally built as it passes coforall/foreach boundaries

Where the implications are:

Sugaring it up

Iterator implementation

I think the iterator side of the story does not need much sugaring up. Parallel iterators are advanced concepts and the example right above is not too bad for me. We can think of making things more explicit:

var context: Context;
coforall locDom in locDoms with (var context = context.newLevel()) do
  on locDom do
    coforall t in tasks (var context = context.newLevel()) do
      foreach (i,j) in myChunk (var context = context.newLevel()) do
        yield ((i,j), context)  // put <body> here

We can also create a new config intent (or context intent, but I don't like that too much).

Things could be even more explicit and custom. I think we can give the iterator implementor as much power as we can at the cost of increased verbosity. There could be a multiresolution-y spectrum there, as well.

User code

I don't like

forall ((i,j), context) in Dom do
  <body>

I think context should be obtained through different means. Some options that I can think of:

forall (i,j) in Dom with (config context) do   // where we grab the context implicitly from the yielded thing
forall (i,j) in Dom as myForall do   // we name the loop `myForall`. Under the hood `myForall` is the context
label myForall forall (i,j) in Dom do   // put more meaning onto the `label` statement as we have today
forall (i,j) as myContext in Dom do  // we name the induction variable. A better fit for what we mean. Not a great look&feel

More power for SPMD programming using foralls

forall (i,j) in Dom as myForall {
  var locArr: [{1..n} dmapped myForall.localeContext] int; // a "collective"/shared allocation amongst
                                                           // tasks of this locale

}
forall (i,j) in Dom as myForall {
  // ....

  myForall.localeContext.barrier();   // barrier only local tasks

  // ....

}

I think if we have more clear direction for the items here, we can come up with forall based implementations of tiled matrix multiply, 2D stencils with bulk halo transfer etc.


"Live" discussion updates

I'll add some ideas that came up in the following discussion here to avoid them being lost in the potential wall of text.

mppf commented 2 years ago

The overall direction here does align with my thinking (although I wonder if the interaction with iterators is fundamental or an implementation strategy).

record LocaleSpecConfig

I'm not opposed to creating room for locale-specific things here but I don't think we need it just for block and grid sizes.

But when it comes to GPU execution, we would also like to give user the control over the block/grid size of kernel launches. That would mean having some context information from the user's loop being trickled up to the outer ones.

I think of it more as information from the loop running a parallel iterator is passed in to the iterator.

forall (i,j) in Dom with (config context) do   // where we grab the context implicitly from the yielded thing
forall (i,j) in Dom as myForall do   // we name the loop `myForall`. Under the hood `myForall` is the context
label myForall forall (i,j) in Dom do   // put more meaning onto the `label` statement as we have today
forall (i,j) as myContext in Dom do  // we name the induction variable. A better fit for what we mean. Not a great look&feel

Of these, forall (i,j) in Dom with (config context) do is my favorite, but I'm not sure I love any of them. The label one is a bit weird but I like that it makes it clear we are naming the loop rather than the domain.

For giving the user control of the grid/block size for a particular forall in order to support GPU execution - we will need this syntactical mechanism to support not only naming the loop information but also providing input to it. We could do that with a separate mechanism (e.g. forall (i,j) in Dom.iterate(gridSize=...) with (config context) do) or we could try to bring them together (e.g. forall (i,j) in Dom with (config context(gridSize=...)) do

e-kayrakli commented 2 years ago

The overall direction here does align with my thinking (although I wonder if the interaction with iterators is fundamental or an implementation strategy).

Yeah, I see what you are saying. I was thinking of this feature as having two interfaces: one facing the application implementer, another facing the iterator implementer. I personally found that it helps creating a full-er framework of the feature mentally.

I'm not opposed to creating room for locale-specific things here but I don't think we need it just for block and grid sizes.

How come? There needs to be some sort of information contained in the Context object. And that information somehow needs to be portable. Were you imagining we could make the Context object ask here internally to get that information somehow?

stonea commented 2 years ago

Could we just add the context information implicitly to here (or some other global) or is the concern that this would be ambiguous in regards to things like nested loops? Either way I like the idea of having some kind of object (whether its something explicit on the loop header or not) that carries information the thread\block info.

I think context should be obtained through different means. Some options that I can think of:

forall (i,j) in Dom with (config context) do   // where we grab the context implicitly from the yielded thing
forall (i,j) in Dom as myForall do   // we name the loop `myForall`. Under the hood `myForall` is the context
label myForall forall (i,j) in Dom do   // put more meaning onto the `label` statement as we have today
forall (i,j) as myContext in Dom do  // we name the induction variable. A better fit for what we mean. Not a great look&feel

I feel like there's got to be some better syntax than any of those you proposed but I'm struggling to come up with it (sorry, I know that's not a very constructive comment). The problem I have with them is it seems ambigious what the new syntax refers to.

We want to make it clear that this context object is some kind of configuration being applied to (or coming out of) the forall itself (as opposed to an operator that's getting applied to Dom for instance).

So for the first and second one I might mistakingly think that the 'with' or 'as' keyword is doing some kind of modification or applying some kind of operation to Dom.

For the fourth one I might think as is applying some kind of tuple-deconstruction to myContext.

Though if you ask around maybe people don't mentally mis-parse this as much as I'm imagining.

The third syntax seems less ambiguous to me but my assumption about a label is that it's a name that would later be supplied to a goto, break, or continue statement.

If I had to pick between your proposed syntaxes I think I'd go with the first one.

forall (i,j) in Dom with (config context) do

Though I think I'd stay away from the word config (you're not using the variable context as an input item to configure the loop right rather its something the iterator is populating and we're using in the body?

Just to brainstorm, what if we had some more general mechanism to iterators that cold "yield" optional named results, for example within the iterator if we were on the GPU locale model we might have:

yield (i,j) with loopInfo = somePrimitiveThatPopulatesTheContextInfo()

Then the user would do something like the following to capture and bind the optional named param:

forall (i,j) in Dom with ctx as loopInfo { ... A[i] = ctx.threadIdxX; ... }


I guess to make this all the more complicated we also want to think of what this will look like when meshed together with whatever syntax we develop for specifying block size (see https://github.com/chapel-lang/chapel/issues/18532)

Seems like what we need is a general way to supply input parameters to iterators and gathering extra optional outputs.

Meshing things together maybe we'd end up with something like:

forall (i,j) in Dom with blockSize(5,5), ctx as loopInfo { ... A[i] = ctx.threadIdxX; ... }


Edit: I see now that you wrote

The flow of information I describe above is 1-way -- outer loops' context needs to be trickled down to the inner ones. But when it comes to GPU execution, we would also like to give user the control over the block/grid size of kernel launches. That would mean having some context information from the user's loop being trickled up to the outer ones.

Maybe I'd need to see an example to understand what this might look like. Were you thinking of something like:

context.blockSize = (5,5)
forall (i,j) in Dom with (config context) do   
e-kayrakli commented 2 years ago

Could we just add the context information implicitly to here (or some other global) or is the concern that this would be ambiguous in regards to things like nested loops? Either way I like the idea of having some kind of object (whether its something explicit on the loop header or not) that carries information the thread\block info.

Yes, potential confusion with nested loops is a part of it. Another part is forall task IDs and relevant queries (eg, what is my left neighbor?) feels tightly connected to how the loop is chunking up the work rather than the locale on which a particular task is running. So, here will have some part to play in this, IMO, but it shouldn't be the main concept that we are leaning on. And probably that part that's played by here will be in the iterator implementation rather than the application code.

I feel like there's got to be some better syntax than any of those you proposed but I'm struggling to come up with it (sorry, I know that's not a very constructive comment). The problem I have with them is it seems ambigious what the new syntax refers to.

We want to make it clear that this context object is some kind of configuration being applied to (or coming out of) the forall itself (as opposed to an operator that's getting applied to Dom for instance).

So for the first and second one I might mistakingly think that the 'with' or 'as' keyword is doing some kind of modification or applying some kind of operation to Dom.

For the fourth one I might think as is applying some kind of tuple-deconstruction to myContext.

I agree with everything here. Though note that the with clause is just relying on the existing syntax, which is arguably prone to the same misinterpretation. So, if the user is familiar with the with clause today, they would intuitively see that that clause is not about the iterand but about the loop as a whole.

The third syntax seems less ambiguous to me but my assumption about a label is that it's a name that would later be supplied to a goto, break, or continue statement.

Yep, that's the use for labels today. Philosophically it is similar though. When I say break myForall; I am referring to the forall as if it was an object. myForall.myTaskID feels OK with that mindset. To be clear that option is still not my favorite.

Just to brainstorm, what if we had some more general mechanism to iterators that cold "yield" optional named results, for example within the iterator if we were on the GPU locale model we might have:

yield (i,j) with loopInfo = somePrimitiveThatPopulatesTheContextInfo()

Oh, I actually really like the yield ... with ... statement. It is much clearer than an arbitrary bundling as I was originally proposing. At first, I would probably disallow assigning within the with clause and require that loopInfo is created before the yield ... with ... statement, so it is like yield (i,j) with loopInfo;. Reads pretty clean to me.

If we want to be fully consistent, the receiving forall could look like:

forall (i,j) with myContext in Dom do

So, the iterator's yield (i,j) with loopInfo and forall (i,j) with myContext looks similar. With a narrow perspective, I like this, as well. But when I think about the with clauses as we have today, things get a bit murky. Is it too ugly to have two withs in one loop "header"? The answer is probably "yes"..

Maybe I'd need to see an example to understand what this might look like. Were you thinking of something like:

context.blockSize = (5,5)
forall (i,j) in Dom with (config context) do 

I don't really know at this point. Thinking out loud, if I were to type something, it'd be probably like

forall (i,j) in Dom with (config context = new Context(blockSize=(5,5))) do

I imagine Context objects to be like Russian dolls. The outermost scope will create a context, and while it is being passed to an inner loop, inner loop's context will be incorporated in the outer's context. And in the body of the inner loop the context is this whole nested thing that represents the loop nest.

So, if you have

forall (i,j) in Dom with (config context) do

The compiler will create a default context for this particular loop and add it to the overall context magically. That context will be referred to as context inside the loop body.

On the other hand:

forall (i,j) in Dom with (config context = new Context(blockSize=(5,5))) do

Will create a custom context with the custom block size, and the compiler will again meld this inner context into the overall context and make it available through context in the loop body. Is it too much magic?


Another syntax alternative:

forall as myForall (i, j) in Dom do

I don't like that it breaks the plain English reading "for all i, j", though.

mppf commented 2 years ago

I'm not opposed to creating room for locale-specific things here but I don't think we need it just for block and grid sizes.

How come? There needs to be some sort of information contained in the Context object. And that information somehow needs to be portable. Were you imagining we could make the Context object ask here internally to get that information somehow?

Because I don't think of block and grid sizes as GPU specific but rather decisions inherent also to vectorization.

mppf commented 2 years ago

Could we just add the context information implicitly to here (or some other global) or is the concern that this would be ambiguous in regards to things like nested loops? Either way I like the idea of having some kind of object (whether its something explicit on the loop header or not) that carries information the thread\block info.

Yes, potential confusion with nested loops is a part of it. Another part is forall task IDs and relevant queries (eg, what is my left neighbor?) feels tightly connected to how the loop is chunking up the work rather than the locale on which a particular task is running. So, here will have some part to play in this, IMO, but it shouldn't be the main concept that we are leaning on. And probably that part that's played by here will be in the iterator implementation rather than the application code.

What if we imagine adding a new thing like here but with a different name that is for loops? For example, we might call it loop or currentLoop or group or something. Or Forall as the original post in this issue proposes. I think we need to keep this option open in our discussion - at least I haven't been convinced yet that we're ready to rule it out.

label myForall forall (i,j) in Dom do // put more meaning onto the label statement as we have today

Looking back at this one, I am not so sure this is an acceptable strategy. Why? I see our goal as having a SIMT model available to programmers as a lower level option. That means that if there is a forall loop creating vectorization or GPU tasks, if there is a function call inside it, within that function, I expect to be able to access the task IDs.

However it would be OK with me if you have to pass some sort of object to the called function. But passing a label as an argument just seems outright wrong.

IMO the contenders are:

I think it's good to continue brainstorming but I also think it would help to start listing the Pros/Cons of these in order to try to choose.

mppf commented 2 years ago

Also besides the more syntactical level @e-kayrakli I'm not sure if you've tried to answer the questions in the original post about which task IDs we want to have. It sounds like, from your implementation idea, you are thinking that we will necessarily have a nesting of task IDs = the number of coforall / foreach loops in the iterator. But since we can transform/normalize them maybe you are not proposing that.

In particular, what do you think about these options from the OP?

At this point we have 2 options:

  1. Create task ids for each level of coforall in which the tasks are created. This has the drawback of creating arbitrary heirarchy in the task ID information that might be difficult for the author of a forall to predict. For example, when using a different locale model, an iterator might create more levels of tasks. The forall might expect something in particular for the type information of the task IDs that does not apply in that configuration.
  2. Flatten the task ids in some manner. This has the drawback of potentially limiting what the user can do in terms of working across different levels of the hierarchy. However at least it is more possible to create code that will work on other configurations. The question is, how many levels of IDs do we consider?
e-kayrakli commented 2 years ago

What if we imagine adding a new thing like here but with a different name that is for loops? For example, we might call it loop or currentLoop or group or something. Or Forall as the original post in this issue proposes. I think we need to keep this option open in our discussion - at least I haven't been convinced yet that we're ready to rule it out.

I didn't mean to rule it out if that was the impression. It is still not my favorite though.

Confusion with nested loops is one thing, but I am ready to set it aside as it is not the most common case. I don't like a symbol referring to a construct in the code. here doesn't represent anything that you can point to in the code and in that sense it is a more abstract context. I think referring to something that I can literally see in the code must be done more explicitly.

label myForall forall (i,j) in Dom do // put more meaning onto the label statement as we have today

Looking back at this one, I am not so sure this is an acceptable strategy. Why? I see our goal as having a SIMT model available to programmers as a lower level option. That means that if there is a forall loop creating vectorization or GPU tasks, if there is a function call inside it, within that function, I expect to be able to access the task IDs.

However it would be OK with me if you have to pass some sort of object to the called function. But passing a label as an argument just seems outright wrong.

That's a good point.

IMO the contenders are:

  • think of it as extra information yielded by the iterator (which also maybe connects to Brad's early comment about utility iterators) (yield (i,j) with loopInfo etc)
  • some sort of loop intent (e.g. with (config))

To be clear, these two are different parts of the design, right? The first is about the iterator syntax, the second is about the forall syntax.

I think it's good to continue brainstorming but I also think it would help to start listing the Pros/Cons of these in order to try to choose.

I will do that soon. I like to be more confident about what's going to happen under the hood, before trying to come to a consensus on how we'll represent that in the language.


Also besides the more syntactical level @e-kayrakli I'm not sure if you've tried to answer the questions in the original post about which task IDs we want to have. It sounds like, from your implementation idea, you are thinking that we will necessarily have a nesting of task IDs = the number of coforall / foreach loops in the iterator. But since we can transform/normalize them maybe you are not proposing that.

In particular, what do you think about these options from the OP?

At this point we have 2 options:

  1. Create task ids for each level of coforall in which the tasks are created. This has the drawback of creating arbitrary heirarchy in the task ID information that might be difficult for the author of a forall to predict. For example, when using a different locale model, an iterator might create more levels of tasks. The forall might expect something in particular for the type information of the task IDs that does not apply in that configuration.
  2. Flatten the task ids in some manner. This has the drawback of potentially limiting what the user can do in terms of working across different levels of the hierarchy. However at least it is more possible to create code that will work on other configurations. The question is, how many levels of IDs do we consider?

@mppf - Ah sorry about not responding to those, though I prepared my design while trying to mentally answer those questions. So they were definitely useful and not ignored.

Although my ideas are not super-concrete at this point, I am proposing nested IDs, so option 1. I understand that the nesting can change depending on the configuration. But I think that's a matter of documentation. If an iterator yields a context along with actual items, it should describe what that context looks like. We can think of making a Context superclass that iterators can inherit from as they wish. Those inherited Contexts can provide portability functions that can handle flattening for example. Things can fall out even nicely if we think of Context as not a superclass but an interface.

More philosophically, I think the design here should be more flexible and less compiler-driven than automatically flattening IDs would imply. I think I've said this elsewhere, but this might increase the work for the iterator implementor, but I don't see that as a big problem. I don't think we'll force every parallel iterator to yield a context, nor every forall to capture those contexts.

e-kayrakli commented 2 years ago

On the GPU meeting we discussed a little bit about this design. One realization was that the context object needs to be a single type at every level. Which I believe is doable. But I think we'll have different types based on the locale model. So, if you are on CPU locale model, the context object doesn't need the capabilities to store blockSize for example. However, if you have GPU locale model, then it needs to be able to carry that information, although it will not necessarily store anything meaningful. Because not every context in GPU locale model execution is GPU context.

Maybe we can implement context objects as part of locale models and very similar to them. There can be a BaseContext, which is inherited by the locale models to create their own execution context. The BaseContext object can provide the portability of context objects in that setup.

mppf commented 2 years ago

One realization was that the context object needs to be a single type at every level.

The reason for this is that an iterator can have runtime choices / conditionals inside of it that choose among several loop nests with different numbers of nested loops. In fact the way that GPU execution of a forall loop works, with a conditional like "is here a GPU sublocale", might already run into this issue.

e-kayrakli commented 2 years ago

In fact the way that GPU execution of a forall loop works, with a conditional like "is here a GPU sublocale", might already run into this issue.

Currently it is based on task's sublocale ID. So we don't have an issue there, I think.

mppf commented 2 years ago

I was mainly just saying that we do have a runtime conditional selecting between different implementations of the forall