Open mppf opened 4 years ago
While I understand the desire for these types of values, these types of discussions always worry me a bit in that (a) they seem to assume or impose some sort of structure on parallel iterators or at least "well-behaved" (by some definition) parallel iterators; (b) as the issue points out, there may be varying levels of depth within a loop or that one might care about.
At a very, very high level, where my head goes here is that for cases that want to know such "task i of n", "locale j of p" values, is there a way that they could opt into getting that information from the loop itself via a utility iterator. Such that rather than having some built-in identifier that I'd refer to to make such queries (like Forall
above), I'd have to opt into it by using such a utility iterator that would yield those values back as part of its interface, along with the original index. So where I might have written forall (i,j) in Dom
, if I wanted to know the task / locale mapping I might say something like forall ((i,j), (task, taskSpace), (loc, locSpace)) in queryLocTaskIter(Dom)
, where taskSpace
might be a range from 0..<numTasks
and locSpace
might be something like a distributed domain's 2D targetLocales domain (so task
would be an integer from taskSpace and loc
would be a 2D index from locSpace
).
I think this has some pretty significant challenges (namely, how do we implement queryLocTaskIter()
such that it interacts with the default iterator of Dom
?), but those challenges seem similar to the ones we'd have to face in any scheme. Meanwhile, I like that (1) it makes it clearer that these values are being generated/yielded by the iterator and (2) that it makes the invocation of the utility iterator itself something of an opt-in for what the loop expects to depend upon these values (rather than relying on compiler inspection of the loop body to determine the same thing). The main thing I'm not crazy about from a user/usability perspective is that it yields taskSpace
and locSpace
on each iteration even though they'll be invariant for the whole loop (typically, at least... I suppose if we had a dynamic iterator that added or removed tasks as it ran, taskSpace might change over time...?).
At a very, very high level, where my head goes here is that for cases that want to know such "task i of n", "locale j of p" values, is there a way that they could opt into getting that information from the loop itself via a utility iterator.
In my view, the main drawback of that approach is that it's required to modify the forall
loop in order to gather the information (rather than, say, being able to find the information from a called function).
Anyway if we did go in this direction it would make more sense to me to have the iterator wrapper yield some sort of object bundling up the information. (In some proposals I have called this loopConfiguration
but here it would be slightly different because it would need to include the current task number - which the other proposals don't necessarily include). That way, it's easy to pass to a subroutine if that is desired.
Just noting that there is some philosophical similarity between this idea and the idea of iterators with "shape" that are currently used to infer the domain in things like var a = myIter()
and to convert to an array in parallel.
@bradcray
At a very, very high level, where my head goes here is that for cases that want to know such "task i of n", "locale j of p" values, is there a way that they could opt into getting that information from the loop itself via a utility iterator. Such that rather than having some built-in identifier that I'd refer to to make such queries (like Forall above), I'd have to opt into it by using such a utility iterator that would yield those values back as part of its interface, along with the original index.
What would you think if we were not using a built-in but a user variable that's used as an intent:
forall (i,j) in Dom with (var forallMagic = new ForallConfig()) { /* bla */ }
where the compiler can initialize forallMagic
. Arguably it is still more compiler-driven than what you're suggesting, but maybe less opaque compared to a built-in (what's the builtin going to look like if I had nested foralls and wanted to know about the outer forall from the innermost block)?
A more sci-fi flavor of that could use the config
keyword:
forall (i,j) in Dom with (config forallMagic) { /* bla */ }
At a high level, using an intent-based approach passes the sniff test for me, like the utility iterator approach, in that it relies less on implicit / not visible in the sources variables and links the query to the loop structure being used.
@mppf -- in your original post and subsequent comments, you sound like you don't like changing the forall "header". Is there any reason for that from performance/functionality or is it more driven by personal taste?
For example, in the tiled matrix transpose example, the first code that doesn't have any of the features that you propose:
// tiled matrix transpose
forall (i, j) in {1..n by t, 1..n by t} {
var tile:[0..#t, 0..#t] int;
tile = Input[i..#t, j..#t];
Output[j..#t, i..#t] = tile;
}
feels pretty clean to me even though you'd have to change the forall iterand. This arguably has a lot more overhead because of slicing. Are you implying that we won't be able to get rid of them and a "lower-level" approach as you propose here would be necessary?
@e-kayrakli - it has to do with the notional idea in Chapel that we can have domain scientists writing "science" and then performance tuning people adjusting how this code runs. In particular we have the idea today that we can write forall
loops and then later on adjust how they are running by changing distributions. But the trouble with changing the "forall header" say to encode the tile size prevents us from making use of distribution-like ideas. For example, if the matrix transpose were distributed, we'd want to run the forall
across a bunch of locales, but now in the tiled version it is over a new domain. And, now the fact that we needed this tiling at all is encoded in what I would imagine is the "science" part of the program where I think it belongs in the "performance / machine tuning / cs" part
I see this as related to #14405 in that the reason I wanted SPMD programming + forall is so that I can keep a certain amount of flexibility (forall can adapt to different distributions etc) while enabling more optimal expression of certain algorithms (like, removing zeros on that issue).
@e-kayrakli - it has to do with the notional idea in Chapel that we can have domain scientists writing "science" and then performance tuning people adjusting how this code runs. In particular we have the idea today that we can write forall loops and then later on adjust how they are running by changing distributions. But the trouble with changing the "forall header" say to encode the tile size prevents us from making use of distribution-like ideas. For example, if the matrix transpose were distributed, we'd want to run the forall across a bunch of locales, but now in the tiled version it is over a new domain. And, now the fact that we needed this tiling at all is encoded in what I would imagine is the "science" part of the program where I think it belongs in the "performance / machine tuning / cs" part
I see. That's a fair point. But in this particular example we can think of the by
operator being applicable to domains and use that to alleviate some of the issues that you mention, right? So the forall header could look like forall (i,j) in Dom by t
or Dom by (t,t)
? While saying that I recognize that your high-level concern can still be true in other cases and cannot be addressed by other language changes like this.
Yeah. Adding by
for domains has other issues but I think it could apply to this specific case. However it still has the (IMO undesireable) property that the forall
loop is indicating the tile size and the body needs to know this tile size. Without things like task IDs/ ability to query block sizes / etc -- the forall loop needs to be modified in the header and the loop body in a coordinated manner to work.
Based on the discussion above, I think we can introduce some sort of a "context" variable. My current thoughts are below. There's some more thinking to be done, but I wanted to check if I am going in the wrong direction.
forall (i,j) in Dom do
<body>
is typically equivalent to some combination of coforall/on/foreach/for statements:
coforall locDom in locDoms do
on locDom do
coforall t in tasks do
foreach (i,j) in myChunk do
yield (i,j) // put <body> here
What we need here is a way for <body>
to make use of the context of the statements in the outer scope of the loop nest. This context can help answer queries like
here
in the iteration space of this loop?here
, where is the task that executes this chunk of iteration?What if there were a special context object that has special semantics when being passed to coforalls and other constructs? Think of this type as having a unique default task intent. This context object can be represented with some sort of a linked structure and can potentially look like:
class Context {
var inductionVar;
var iterand;
var localeSpecificConfig: LocaleSpecConfig;
var outer: Context?;
var inner: Context?;
}
record LocaleSpecConfig { // A GPU specific thing. We can imagine this being a union
// where for GPU locales we have GPU-specific info like this
// otherwise, they could be vectorization-specific info
var blockDim: 3*int;
var blockIdx: 3*int;
var gridDim: 3*int;
var gridIdx: 3*int;
}
Going back to the initial example, the sketch could look something like this:
forall ((i,j), context) in Dom do
<body>
is typically equivalent to some combination of coforall/on/foreach/for statements:
var context: Context;
coforall locDom in locDoms do
on locDom do
coforall t in tasks do
foreach (i,j) in myChunk do
yield ((i,j), context) // `context` is incrementally built as it passes coforall/foreach boundaries
Where the implications are:
I think the iterator side of the story does not need much sugaring up. Parallel iterators are advanced concepts and the example right above is not too bad for me. We can think of making things more explicit:
var context: Context;
coforall locDom in locDoms with (var context = context.newLevel()) do
on locDom do
coforall t in tasks (var context = context.newLevel()) do
foreach (i,j) in myChunk (var context = context.newLevel()) do
yield ((i,j), context) // put <body> here
We can also create a new config
intent (or context
intent, but I don't like that too much).
Things could be even more explicit and custom. I think we can give the iterator implementor as much power as we can at the cost of increased verbosity. There could be a multiresolution-y spectrum there, as well.
I don't like
forall ((i,j), context) in Dom do
<body>
I think context
should be obtained through different means. Some options that I can think of:
forall (i,j) in Dom with (config context) do // where we grab the context implicitly from the yielded thing
forall (i,j) in Dom as myForall do // we name the loop `myForall`. Under the hood `myForall` is the context
label myForall forall (i,j) in Dom do // put more meaning onto the `label` statement as we have today
forall (i,j) as myContext in Dom do // we name the induction variable. A better fit for what we mean. Not a great look&feel
Can we make the context objects more powerful to support queries like
myContext.outer.outer
, but then it may get ambiguous if you have deeper nests of coforallsCan we make allocations based on context? e.g.,
forall (i,j) in Dom as myForall {
var locArr: [{1..n} dmapped myForall.localeContext] int; // a "collective"/shared allocation amongst
// tasks of this locale
}
forall (i,j) in Dom as myForall {
// ....
myForall.localeContext.barrier(); // barrier only local tasks
// ....
}
I think if we have more clear direction for the items here, we can come up with forall
based implementations of tiled matrix multiply, 2D stencils with bulk halo transfer etc.
I'll add some ideas that came up in the following discussion here to avoid them being lost in the potential wall of text.
forall (i,j) in Dom.iterate(gridSize=...) with (config context) do
forall (i,j) in Dom with (config context(gridSize=...) do
yield (i,j) with loopInfo = somePrimitiveThatPopulatesTheContextInfo()
(@stonea)
forall (i,j) with loopInfo
as a corollary (@e-kayrakli)The overall direction here does align with my thinking (although I wonder if the interaction with iterators is fundamental or an implementation strategy).
record LocaleSpecConfig
I'm not opposed to creating room for locale-specific things here but I don't think we need it just for block and grid sizes.
But when it comes to GPU execution, we would also like to give user the control over the block/grid size of kernel launches. That would mean having some context information from the user's loop being trickled up to the outer ones.
I think of it more as information from the loop running a parallel iterator is passed in to the iterator.
forall (i,j) in Dom with (config context) do // where we grab the context implicitly from the yielded thing
forall (i,j) in Dom as myForall do // we name the loop `myForall`. Under the hood `myForall` is the context
label myForall forall (i,j) in Dom do // put more meaning onto the `label` statement as we have today
forall (i,j) as myContext in Dom do // we name the induction variable. A better fit for what we mean. Not a great look&feel
Of these, forall (i,j) in Dom with (config context) do
is my favorite, but I'm not sure I love any of them. The label one is a bit weird but I like that it makes it clear we are naming the loop rather than the domain.
For giving the user control of the grid/block size for a particular forall in order to support GPU execution - we will need this syntactical mechanism to support not only naming the loop information but also providing input to it. We could do that with a separate mechanism (e.g. forall (i,j) in Dom.iterate(gridSize=...) with (config context) do
) or we could try to bring them together (e.g. forall (i,j) in Dom with (config context(gridSize=...)) do
The overall direction here does align with my thinking (although I wonder if the interaction with iterators is fundamental or an implementation strategy).
Yeah, I see what you are saying. I was thinking of this feature as having two interfaces: one facing the application implementer, another facing the iterator implementer. I personally found that it helps creating a full-er framework of the feature mentally.
I'm not opposed to creating room for locale-specific things here but I don't think we need it just for block and grid sizes.
How come? There needs to be some sort of information contained in the Context
object. And that information somehow needs to be portable. Were you imagining we could make the Context
object ask here
internally to get that information somehow?
Could we just add the context information implicitly to here
(or some other global) or is the concern that this would be ambiguous in regards to things like nested loops? Either way I like the idea of having some kind of object (whether its something explicit on the loop header or not) that carries information the thread\block info.
I think context should be obtained through different means. Some options that I can think of:
forall (i,j) in Dom with (config context) do // where we grab the context implicitly from the yielded thing
forall (i,j) in Dom as myForall do // we name the loop `myForall`. Under the hood `myForall` is the context
label myForall forall (i,j) in Dom do // put more meaning onto the `label` statement as we have today
forall (i,j) as myContext in Dom do // we name the induction variable. A better fit for what we mean. Not a great look&feel
I feel like there's got to be some better syntax than any of those you proposed but I'm struggling to come up with it (sorry, I know that's not a very constructive comment). The problem I have with them is it seems ambigious what the new syntax refers to.
We want to make it clear that this context object is some kind of configuration being applied to (or coming out of) the forall
itself (as opposed to an operator that's getting applied to Dom
for instance).
So for the first and second one I might mistakingly think that the 'with' or 'as' keyword is doing some kind of modification or applying some kind of operation to Dom
.
For the fourth one I might think as
is applying some kind of tuple-deconstruction to myContext.
Though if you ask around maybe people don't mentally mis-parse this as much as I'm imagining.
The third syntax seems less ambiguous to me but my assumption about a label is that it's a name that would later be supplied to a goto
, break
, or continue
statement.
If I had to pick between your proposed syntaxes I think I'd go with the first one.
forall (i,j) in Dom with (config context) do
Though I think I'd stay away from the word config
(you're not using the variable context as an input item to configure the loop right rather its something the iterator is populating and we're using in the body?
Just to brainstorm, what if we had some more general mechanism to iterators that cold "yield" optional named results, for example within the iterator if we were on the GPU locale model we might have:
yield (i,j) with loopInfo = somePrimitiveThatPopulatesTheContextInfo()
Then the user would do something like the following to capture and bind the optional named param:
forall (i,j) in Dom with ctx as loopInfo { ... A[i] = ctx.threadIdxX; ... }
I guess to make this all the more complicated we also want to think of what this will look like when meshed together with whatever syntax we develop for specifying block size (see https://github.com/chapel-lang/chapel/issues/18532)
Seems like what we need is a general way to supply input parameters to iterators and gathering extra optional outputs.
Meshing things together maybe we'd end up with something like:
forall (i,j) in Dom with blockSize(5,5), ctx as loopInfo { ... A[i] = ctx.threadIdxX; ... }
Edit: I see now that you wrote
The flow of information I describe above is 1-way -- outer loops' context needs to be trickled down to the inner ones. But when it comes to GPU execution, we would also like to give user the control over the block/grid size of kernel launches. That would mean having some context information from the user's loop being trickled up to the outer ones.
Maybe I'd need to see an example to understand what this might look like. Were you thinking of something like:
context.blockSize = (5,5)
forall (i,j) in Dom with (config context) do
Could we just add the context information implicitly to here (or some other global) or is the concern that this would be ambiguous in regards to things like nested loops? Either way I like the idea of having some kind of object (whether its something explicit on the loop header or not) that carries information the thread\block info.
Yes, potential confusion with nested loops is a part of it. Another part is forall task IDs and relevant queries (eg, what is my left neighbor?) feels tightly connected to how the loop is chunking up the work rather than the locale on which a particular task is running. So, here
will have some part to play in this, IMO, but it shouldn't be the main concept that we are leaning on. And probably that part that's played by here
will be in the iterator implementation rather than the application code.
I feel like there's got to be some better syntax than any of those you proposed but I'm struggling to come up with it (sorry, I know that's not a very constructive comment). The problem I have with them is it seems ambigious what the new syntax refers to.
We want to make it clear that this context object is some kind of configuration being applied to (or coming out of) the forall itself (as opposed to an operator that's getting applied to Dom for instance).
So for the first and second one I might mistakingly think that the 'with' or 'as' keyword is doing some kind of modification or applying some kind of operation to Dom.
For the fourth one I might think as is applying some kind of tuple-deconstruction to myContext.
I agree with everything here. Though note that the with
clause is just relying on the existing syntax, which is arguably prone to the same misinterpretation. So, if the user is familiar with the with
clause today, they would intuitively see that that clause is not about the iterand but about the loop as a whole.
The third syntax seems less ambiguous to me but my assumption about a label is that it's a name that would later be supplied to a goto, break, or continue statement.
Yep, that's the use for labels today. Philosophically it is similar though. When I say break myForall;
I am referring to the forall as if it was an object. myForall.myTaskID
feels OK with that mindset. To be clear that option is still not my favorite.
Just to brainstorm, what if we had some more general mechanism to iterators that cold "yield" optional named results, for example within the iterator if we were on the GPU locale model we might have:
yield (i,j) with loopInfo = somePrimitiveThatPopulatesTheContextInfo()
Oh, I actually really like the yield ... with ...
statement. It is much clearer than an arbitrary bundling as I was originally proposing. At first, I would probably disallow assigning within the with
clause and require that loopInfo
is created before the yield ... with ...
statement, so it is like yield (i,j) with loopInfo;
. Reads pretty clean to me.
If we want to be fully consistent, the receiving forall could look like:
forall (i,j) with myContext in Dom do
So, the iterator's yield (i,j) with loopInfo
and forall (i,j) with myContext
looks similar. With a narrow perspective, I like this, as well. But when I think about the with
clauses as we have today, things get a bit murky. Is it too ugly to have two with
s in one loop "header"? The answer is probably "yes"..
Maybe I'd need to see an example to understand what this might look like. Were you thinking of something like:
context.blockSize = (5,5) forall (i,j) in Dom with (config context) do
I don't really know at this point. Thinking out loud, if I were to type something, it'd be probably like
forall (i,j) in Dom with (config context = new Context(blockSize=(5,5))) do
I imagine Context
objects to be like Russian dolls. The outermost scope will create a context, and while it is being passed to an inner loop, inner loop's context will be incorporated in the outer's context. And in the body of the inner loop the context is this whole nested thing that represents the loop nest.
So, if you have
forall (i,j) in Dom with (config context) do
The compiler will create a default context for this particular loop and add it to the overall context magically. That context will be referred to as context
inside the loop body.
On the other hand:
forall (i,j) in Dom with (config context = new Context(blockSize=(5,5))) do
Will create a custom context with the custom block size, and the compiler will again meld this inner context into the overall context and make it available through context
in the loop body. Is it too much magic?
forall as myForall (i, j) in Dom do
I don't like that it breaks the plain English reading "for all i, j", though.
I'm not opposed to creating room for locale-specific things here but I don't think we need it just for block and grid sizes.
How come? There needs to be some sort of information contained in the
Context
object. And that information somehow needs to be portable. Were you imagining we could make theContext
object askhere
internally to get that information somehow?
Because I don't think of block and grid sizes as GPU specific but rather decisions inherent also to vectorization.
Could we just add the context information implicitly to here (or some other global) or is the concern that this would be ambiguous in regards to things like nested loops? Either way I like the idea of having some kind of object (whether its something explicit on the loop header or not) that carries information the thread\block info.
Yes, potential confusion with nested loops is a part of it. Another part is forall task IDs and relevant queries (eg, what is my left neighbor?) feels tightly connected to how the loop is chunking up the work rather than the locale on which a particular task is running. So,
here
will have some part to play in this, IMO, but it shouldn't be the main concept that we are leaning on. And probably that part that's played byhere
will be in the iterator implementation rather than the application code.
What if we imagine adding a new thing like here
but with a different name that is for loops? For example, we might call it loop
or currentLoop
or group
or something. Or Forall
as the original post in this issue proposes. I think we need to keep this option open in our discussion - at least I haven't been convinced yet that we're ready to rule it out.
label myForall forall (i,j) in Dom do // put more meaning onto the
label
statement as we have today
Looking back at this one, I am not so sure this is an acceptable strategy. Why? I see our goal as having a SIMT model available to programmers as a lower level option. That means that if there is a forall loop creating vectorization or GPU tasks, if there is a function call inside it, within that function, I expect to be able to access the task IDs.
However it would be OK with me if you have to pass some sort of object to the called function. But passing a label as an argument just seems outright wrong.
IMO the contenders are:
yield (i,j) with loopInfo
etc)with (config)
)here
but about loop executiongetCurrentVectorLane
, getTaskWithinLocale
, etc, that have special behaviorI think it's good to continue brainstorming but I also think it would help to start listing the Pros/Cons of these in order to try to choose.
Also besides the more syntactical level @e-kayrakli I'm not sure if you've tried to answer the questions in the original post about which task IDs we want to have. It sounds like, from your implementation idea, you are thinking that we will necessarily have a nesting of task IDs = the number of coforall / foreach loops in the iterator. But since we can transform/normalize them maybe you are not proposing that.
In particular, what do you think about these options from the OP?
At this point we have 2 options:
- Create task ids for each level of coforall in which the tasks are created. This has the drawback of creating arbitrary heirarchy in the task ID information that might be difficult for the author of a
forall
to predict. For example, when using a different locale model, an iterator might create more levels of tasks. Theforall
might expect something in particular for the type information of the task IDs that does not apply in that configuration.- Flatten the task ids in some manner. This has the drawback of potentially limiting what the user can do in terms of working across different levels of the hierarchy. However at least it is more possible to create code that will work on other configurations. The question is, how many levels of IDs do we consider?
What if we imagine adding a new thing like here but with a different name that is for loops? For example, we might call it loop or currentLoop or group or something. Or Forall as the original post in this issue proposes. I think we need to keep this option open in our discussion - at least I haven't been convinced yet that we're ready to rule it out.
I didn't mean to rule it out if that was the impression. It is still not my favorite though.
Confusion with nested loops is one thing, but I am ready to set it aside as it is not the most common case. I don't like a symbol referring to a construct in the code. here
doesn't represent anything that you can point to in the code and in that sense it is a more abstract context. I think referring to something that I can literally see in the code must be done more explicitly.
label myForall forall (i,j) in Dom do // put more meaning onto the label statement as we have today
Looking back at this one, I am not so sure this is an acceptable strategy. Why? I see our goal as having a SIMT model available to programmers as a lower level option. That means that if there is a forall loop creating vectorization or GPU tasks, if there is a function call inside it, within that function, I expect to be able to access the task IDs.
However it would be OK with me if you have to pass some sort of object to the called function. But passing a label as an argument just seems outright wrong.
That's a good point.
IMO the contenders are:
- think of it as extra information yielded by the iterator (which also maybe connects to Brad's early comment about utility iterators) (yield (i,j) with loopInfo etc)
- some sort of loop intent (e.g. with (config))
To be clear, these two are different parts of the design, right? The first is about the iterator syntax, the second is about the forall
syntax.
I think it's good to continue brainstorming but I also think it would help to start listing the Pros/Cons of these in order to try to choose.
I will do that soon. I like to be more confident about what's going to happen under the hood, before trying to come to a consensus on how we'll represent that in the language.
Also besides the more syntactical level @e-kayrakli I'm not sure if you've tried to answer the questions in the original post about which task IDs we want to have. It sounds like, from your implementation idea, you are thinking that we will necessarily have a nesting of task IDs = the number of coforall / foreach loops in the iterator. But since we can transform/normalize them maybe you are not proposing that.
In particular, what do you think about these options from the OP?
At this point we have 2 options:
- Create task ids for each level of coforall in which the tasks are created. This has the drawback of creating arbitrary heirarchy in the task ID information that might be difficult for the author of a
forall
to predict. For example, when using a different locale model, an iterator might create more levels of tasks. Theforall
might expect something in particular for the type information of the task IDs that does not apply in that configuration.- Flatten the task ids in some manner. This has the drawback of potentially limiting what the user can do in terms of working across different levels of the hierarchy. However at least it is more possible to create code that will work on other configurations. The question is, how many levels of IDs do we consider?
@mppf - Ah sorry about not responding to those, though I prepared my design while trying to mentally answer those questions. So they were definitely useful and not ignored.
Although my ideas are not super-concrete at this point, I am proposing nested IDs, so option 1. I understand that the nesting can change depending on the configuration. But I think that's a matter of documentation. If an iterator yields a context along with actual items, it should describe what that context looks like. We can think of making a Context superclass that iterators can inherit from as they wish. Those inherited Contexts can provide portability functions that can handle flattening for example. Things can fall out even nicely if we think of Context as not a superclass but an interface.
More philosophically, I think the design here should be more flexible and less compiler-driven than automatically flattening IDs would imply. I think I've said this elsewhere, but this might increase the work for the iterator implementor, but I don't see that as a big problem. I don't think we'll force every parallel iterator to yield a context, nor every forall to capture those contexts.
On the GPU meeting we discussed a little bit about this design. One realization was that the context object needs to be a single type at every level. Which I believe is doable. But I think we'll have different types based on the locale model. So, if you are on CPU locale model, the context object doesn't need the capabilities to store blockSize for example. However, if you have GPU locale model, then it needs to be able to carry that information, although it will not necessarily store anything meaningful. Because not every context in GPU locale model execution is GPU context.
Maybe we can implement context objects as part of locale models and very similar to them. There can be a BaseContext
, which is inherited by the locale models to create their own execution context. The BaseContext
object can provide the portability of context objects in that setup.
One realization was that the context object needs to be a single type at every level.
The reason for this is that an iterator can have runtime choices / conditionals inside of it that choose among several loop nests with different numbers of nested loops. In fact the way that GPU execution of a forall loop works, with a conditional like "is here
a GPU sublocale", might already run into this issue.
In fact the way that GPU execution of a forall loop works, with a conditional like "is here a GPU sublocale", might already run into this issue.
Currently it is based on task's sublocale ID. So we don't have an issue there, I think.
I was mainly just saying that we do have a runtime conditional selecting between different implementations of the forall
This issue proposes a mechanism whereby the locale, task, and vector lane position within a
forall
loop can be queried within that loop or within a follower iterator implementing that loop. The main question here is whether or not this general direction is worth pursuing.This issue builds upon #14405, CHIP 17, and CHIP 22 and addresses a feature request from this email
Background
Generally speaking a standalone iterator might look like this:
This iterator has 3 portions:
coforall loc
on loc
)coforall t in 1..numChunks
)for i in part
)We are supposing that the 3rd portion also can control vectorization (but see #16404).
In order to enable a SIMT programming model, we would like code invoking this
forall
to be able to create a variable per-vector-lane and to perform reductions across vector lanes. It is already possible to create a per-task variable:Or to reduce across the tasks:
Issue ##16403 proposes that these mechanisms be extended to create per-vector-lane variables and per-task variables.
Proposal
In order to support SIMT programming or SPMD programming it is also necessary to allow the author of a
forall
loop to get task/vector IDs. #14405 talks about one use case for that and CHIP 17 has other examples. In GPU programming this kind of functionality is necessary to get the threads in a warp to combine efforts in order to move data around efficiently. For example, to perform matrix transpose, one might want to copy a block of a matrix to nearby memory; then barrier; and then have the threads collectively store the transpose into the correct memory location.Note that #16406 discusses the barrier specifically.
Specifically, this issue proposes that each
forall
loop create and track several numbers in task-local storage:In addition,
forall
loops can store details of the iteration space in task-local storage.These numbers will make sense for data-parallel forall loops - not all
forall
loops will have meaningful values for these -just as they won't all vectorize.There are quite a few different possible constructions for parallel iterators. Tasks might be created in sublocales. Perhaps no locales are iterated over. Perhaps tasks are created during iteration.
This iterator that creates tasks in a nested way:
This one creates tasks for sublocales:
At this point we have 2 options:
forall
to predict. For example, when using a different locale model, an iterator might create more levels of tasks. Theforall
might expect something in particular for the type information of the task IDs that does not apply in that configuration.I'm pretty sure we need to know the current "Vector lane" or, in GPU terms, the current "thread within the block".
Matrix Transpose Example
Here is an example of a matrix transpose (from https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/ ).
This naive matrix transpose won't have good performance on the GPU due to order of memory operations. What if we want to copy the matrix to some scratch storage first in order to do a tiled transpose?
But, that has the drawback that the
forall
loop needed to be adjusted. Can we use an SPMD/SIMT idea to perform the tiled copy without adjusting the bounds of theforall
? Because each loop iteration can still be responsible for copying one element to the transpose destination, right? (Also, one might want to have aforall
over the elements doing multiple things; like perhaps computing on the result of the transpose after it is done).This example only used the last level of the IDs (for vector lanes / GPU block information). Could we do a similar thing for a distributed matrix transpose?
Open Questions
forall
use this mechanism and get repeatable allocation of iterations to IDs? That is important for the case in #14405. This might be easier if we add work stealing (https://github.com/Cray/chapel-private/issues/1331 is the TODO issue for that) since we wouldn't need to spawn a number of tasks dependent on the number currently running - and would generally create more tasks than cores to enable stealing.