chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

how to mark loops as order independent without adding qthreads-tasks #16404

Closed mppf closed 3 years ago

mppf commented 4 years ago

Split off from a discussion in #7761 driven by #16152.

A forall loop is order independent (by definition, according to the spec). But the iterator implementing a forall statement isn't necessarily order independent. For example, a follower iterator might include a cursor pattern (this comes up today for follower iterators in the Random module):

// imagine serial/standalone/leader iterators too
iter myiter(param tag: iterKind, followThis) ref
where tag == iterKind.follower {
  var cursor: real;
  for i in _value.these(tag, followThis) {
    cursor = nextcursor(cursor);
     yield (i, cursor);
  }
}

Order independence is important in order to vectorize and is related to how we imagine we will support Chapel-native GPU code generation.

So, the question is, how can order independence (or not) be indicated in a follower iterator?

See #16403 strategy for vectorization and GPU support for some design choices leading here.

This issue proposes a new way to indicate that a loop is order independent (besides vectorizeOnly) without creating qthreads-style tasks. This will be suitable to use in follower iterators but could also be used elsewhere. It is important that it works with forall-intents (in, var, and reduce) so that, for example, follower iterators can explicitly handle vectorization scenarios.

This issue will discuss several alternatives for exact syntax for this new strategy.

Design Direction

for loops today

for loops are serial in that they don't create any qthreads-style tasks. However, vectorization might occur as on optimization. In other words, vectorization will occur only when the compiler can prove it is safe to do so (i.e. it won't change the correctness of the program).

A for loop will use a serial iterator.

forall loops today

A forall loops always indicates that the loop body could be run in parallel - but isn't necessarily run in parallel. So, use of the forall keyword indicates that the loop is both order independent and serializable. (See e.g. https://chapel-lang.org/docs/master/language/spec/data-parallelism.html#execution-and-serializability). Vectorization on the inner loops will be hinted to LLVM if the iterators implementing the forall are order independent (after PR #16152).

A forall loop will use a standalone or leader/follower iterators - depending on if zippering occurs.

Follower iterators and vectorization

So, back to the question faced at the outset about order independence in follower iterators. One viewpoint is that a follower iterator should be using forall if it is writing an order independent loop. The problem with doing that is that forall generally means creating qthreads-style tasks (although this is up to the parallel iterator invoked). However, these follower iterators could use a forall that does not create qthreads-style tasks.

However, no matter how an order-independent loop is expressed in the follower, it is important that the compiler can inline this pattern.

vectorizeOnly

Currently we have vectorizeOnly as a way to request vectorization. This currently works with for and forall loops. However, in the context of considering vector lanes to be tasks, this is an odd state to be in:

Could vectorizeOnly be used to mark order-independent loops in follower iterators? Not as vectorizeOnly currently functions:

This issue proposes that vectorizeOnly be deprecated and code using it instead would use a new mechanism.

supporting in var and reduce intents

The no-qthreads order-independent loop syntax needs to support also similar features to forall intents. The reason for this is that the order-independent loop can be vectorized and in that event has the potential to have parallel lock-step tasks implementing it. In order to have those tasks communicate safely in the common cases, they will need reduce intents and also general features along the lines of iteration numbering in #16405.

which iterator is invoked?

Some brainstorming ideas in this direction brought up in #7761 include forall(serial) and adding foreach as a means to do indicate order independence without creating qthreads-style tasks. But, these directions have a hidden challenge. In particular - which iterator would these call? Would we need to have a different leader/follower/standalone for the no-tasks-but-vectorize case?

It would be simplest, in terms of compiler implementation, if the new no-qthreads order-independent loop mechanism called the serial iterator. However, it might be more intuitive if they call a parallel iterator and squash the parallelism.

An additional wrinkle is that a follower iterator can't generally use coforall or forall in a yielding loop (or else we get "invalid use of parallel construct in serial iterator"). The compilation errors for this case would need to be adjusted.

Note that even if the serial iterator is called in this order-independent case - the serial iterator can still adapt to vectorization by itself using the new mechanism along with reduce intents (say).

A further option would be for the new no-qthreads order-independent loop mechanism to call a new iterator overload, if available. Adding such a feature in the future would not be breaking as long as it continues to fall back on whatever we start with. (E.g., if we start with the new mechanism calling the serial iterator; it will still do that as long as the new iterator overload is not present, so won't be backwards-breaking).

Proposals

Proposal 1a: forall and loop configuration

This proposal is to add functionality to the forall loop to avoid creating any qthreads-style tasks and then use this functionality with forall to indicate an order-independent loop in a follower iterator. This approach makes it clear that these order-independent loops can use forall intents to create create per-vector-lane variables or handle reductions across vector lanes.

This proposal adds an idea of loop configuration to a forall loop. The loop configuration will be available to both iterator code being called as well as to the loop body. The iterators being called will need to have an argument for the loop configuration.

This issue further proposes that the forall ... with syntax be able to supply the loop configuration. For a first straw-person example:

forall x in myObject with(orderIndependentOnly=true) { }

What is the Type of a Loop Configuration?

Note that the loop configuration needs to be something that one iterator can pass to another iterator (as it is very common for iterators to call each other).

If #16388 were addressed, we could consider using a tuple with named elements for the loop configuration. In the meantime this issue will propose that the loop configuration is something record-like (but that is not a normal record because of the way the compiler needs to work with it).

It might also be useful to be able to store a loop configuration in a variable in the event that there became quite a few settings. E.g.

var loopConfig = new loopConfig(orderIndependentOnly=true);
forall x in myObject with(loopConfig) { }

However this is not necessary to support at first, as long as the loop configuration can be passed between iterators.

What information is in the Loop Configuration?

I expect that the loop configuration will include the following information:

How does an iterator implementation use the loop configuration?

A follower iterator wanting to indicate it is order independent would use the forall loop with the loop configuration that prevents launching qthreads-style tasks:

proc myiter(... tag=follower, configuration:loopConfiguration, followThis) {
  forall i in 1..n with (orderIndependentOnly=true) {
     yield f(i);
  }
}

The compiler would arrange for forall i in 1..n with (orderIndependentOnly=true) to call the range serial iterator.

A leader iterator would respond to the loop configuration when choosing the number of tasks to create. Instead of this:

proc myiter(... tag=leader) {
  const numChunks = _computeNumChunks(len);
  coforall chunk in 0..#numChunks {
    ...
  }
}

it would be this:

proc myiter(... tag=leader, configuration:loopConfiguration,) {
  const numChunks = _computeNumChunks(configuration, len);
  coforall chunk in 0..#numChunks {
    ...
  }
}

and _computeNumChunks would access a target number of tasks to create from within the loop configuration.

Why not use for loops to express vectorization?

For loops don't support the forall-intents like reduce, in, and var. These will be important for writing vectorized code. Additionally, as we expose SIMT features (see SIMT and SIMD above), these will only be available within forall loops.

Why not an annotation?

One might imagine that instead of writing

forall x in myObject with(vectorizeOnly=true) { }

one could write

@vectorizeOnly
forall x in myObject { }

While that would be OK for things like flags, other loop configuration - such as the number of qthreads style tasks to create - will depend on runtime values. It would be strange to have something like this:

proc f(nTasks: int) {
  @coforallTasks(nTasks)
  forall x in myObject { }
}

Proposal 1b: forall(configuration)

This proposal is the same as the previous proposal but has a syntactic difference.

To write an order-independent loop one would write

var loopConfig = new loopConfig(orderIndependentOnly=true);
forall(loopConfig) x in 1..10

or the shortcut syntax

forall(orderIndependentOnly=true) x in 1..10

Proposal 2: foreach

This proposal adds a new kind of loop, a foreach loop, that is order independent but does not create multiple qthreads-style tasks. The foreach loop supports similar intents to forall loops.

This proposal is inspired by this comment: https://github.com/chapel-lang/chapel/issues/7761#issuecomment-690670109

For example, this would be an order-independent loop without creating qthreads-tasks:

foreach x in myObject { }

How does a follower iterator use the new feature?

A follower iterator wanting to indicate it is order independent would use the foreach loop:

proc myiter(... tag=follower, followThis) {
  foreach i in 1..n {
     yield f(i);
  }
}

The compiler would arrange for foreach i in 1..n to call the range serial iterator.

forall-like intents

It would be possible to use something like forall intents with foreach, e.g.:

// does a vectorization-safe reduction
foreach i in 1..10 with (+ reduce x) { ... }

// copies `a` to a local variable for each vector lane
foreach i in 1..10 with (in a) { ... }

// creates a `b` variable for each vector lane
foreach i in 1..10 with (var b: int) { ... }

Proposal 3: orderIndependent wrapper

For example, this would be an order-independent loop without creating qthreads-tasks:

forall x in orderIndependent(1..10)

Proposal 4: forall(serial)

For example, this would be an order-independent loop without creating qthreads-tasks:

forall(serial) x in 1..10

Note this proposal might not sit well with some answers to how serial impacts vectorization from issue #16403. (Here we would want forall(serial) to vectorize). Additionally it does not make so much sense if we consider vectorization to be creating tasks.

bradcray commented 3 years ago

I really like foreach as a midpoint between for (I'm completely serial) and forall (I'm order independent and potentially/likely use multiple tasks).

bradcray commented 3 years ago

Talking to Michael today, I found myself explaining why I like foreach in more detail, and thought I'd capture that here.

I'll start with the main downsides I see with foreach, which I'd describe as:

I definitely agree that these are concerns, but I don't view them as show-stoppers compared to its benefits. Specifically:

Another proposal that I didn't find too off-putting was Greg's proposal of for unordered i in someIterator() because I think it reads fairly cleanly and has the advantage of being a bit more self-descriptive. However, I don't like it quite as much because:

e-kayrakli commented 3 years ago

I started off thinking that we should avoid adding a new loop flavor, and proposed that we should do something similar to vectorizeOnly as we have it today. But after some discussions I am leaning towards a scheme like the following, which I think was mostly proposed by Greg(?)

for ordered   i in myIterator()       // the loop body is written in a way
                                      // that it must be executed in order, so
                                      // the iterator must have "ordered" tag

// As an alternative to foreach:               
for unordered i in myIterator()       // the loop body is written in a way
                                      // that it can be executed out of order,

for           i in myIterator()       // defaults to ordered

// As an alternative to foriter:
forall ordered   i in myIterator()          // the iterator must be a parallel
                                            // one but must also have "ordered"
                                            // tag

forall unordered i in myIterator()          // the iterator must be a parallel
                                            // one but it doesn't have to have an
                                            // ordered tag

forall         i in myIterator()            // defaults to unordered

I don't feel too entrenched in this, and Brad's arguments for/against sound reasonable to me. My biggest disagreements are:

I feel as though for unordered is virtually equivalent to foreach (in terms of the syntactic placement of markers) while still resulting in a fairly different loop implementation and optimization opportunity compared to for, so I think it tries to make two fairly distinct things look more similar to one another than we should. Maybe put another way, I think of foreach or "unordered one-task loops" as being more similar to forall than to for semantically, so I'd like the concept to look more distinct from for than for unordered does.

If anything, I see the "unordered one-task loops" more similar to for rather than forall. In my view, vectorization is more of a sequential optimization rather than a parallel pne. I may be in the minority in thinking this, but that's why I think the main concept we discuss is more for-like than forall-like. Another more practical way to look at this is if we were to have foreach loop today, we'd be changing bunch of fors in our code to be foreach, and not touch foralls, right?

It might be argued that for unordered i ... looks too syntactically parallel to for param i ... given that unordered is describing something about the loop's nature while param is describing something about i's declaration (which happens to affect the loop's implementation strategy). Of these four bullets, I don't think this one is a major blocker, but we discussed it, so I wanted to capture it.

I am not sure if a follow this, but one can (should?) safely interpret for unordered idx in myIterator() as this loop iterates over "unordered indices". So, the way I see this bullet, it is for for unordered and not against.

From another perspective, one can interpret param in for param idx as something that describes something about the loop's nature (that it is unrolled and not executed as a loop at runtime).

I haven't thought about this parallel between param and unordered until I read your comment, but my reflex was "oh that's great!" :)

mppf commented 3 years ago

for ordered i in myIterator()

I think this is just the same as regular for and so I don't see any need for it to behave differently in any way.

I do like the way that for unordered / forall ordered kindof treats order-independence as its own property (it's "orthogonal"). It doesn't particularly bother me that forall ordered might invoke a different iterator from forall or that for unordered is connecting the idea to for. I mean, foreach and foriter are keywords that connects the idea to for (in a single-word way).

I think my biggest concern with for unordered / forall ordered is that unordered / ordered are more likely to be used as variable/argument names than foreach or foriter are. (@bradcray already brought this up). Related to this is the way that unordered is used as a key term for some manual communication optimization today (e.g. unorderedCopy). We'd have to decide if we want to use the same term for both things (but could rename the package module if needed).

Also I think it's useful here to show a table inspired by @bradcray's post in https://github.com/Cray/chapel-private/issues/1393#issuecomment-738940736

can create tasks? order-dependent order-independent
no for foreach / for unordered
yes forall ordered / foriter usual forall
e-kayrakli commented 3 years ago

I think my biggest concern with for unordered / forall ordered is that unordered / ordered are more likely to be used as variable/argument names than foreach or foriter are. (@bradcray already brought this up). Related to this is the way that unordered is used as a key term for some manual communication optimization today (e.g. unorderedCopy). We'd have to decide if we want to use the same term for both things (but could rename the package module if needed).

I do believe this is a valid concern, but not a big one for me personally. Mainly because I can't imagine any user being frustrated with these two being keywords. In the worst case, they'll search and see what they are used for and go "huh, interesting, so let me rename my variable to inOrder/outOrder etc"

  1. In a performance-oriented language it wouldn't be too surprising to have ordered and unordered as reserved keywords.

  2. For things like unorderedCopy, I don't see this as that big of an issue either. Because "unordered" is a meaningful English word as opposed to, say, "forall". So, it doesn't feel too much like it was unnecessarily overloaded. Also, the meaning of "unordered" in for unordered and unorderedCopy does not feel wildly different. Related is the orderedSet module, but we have had some chats to rename it to sortedSet.

  3. Probably not enough data size, but upon a quick grepping in our tests, Arkouda and CHAMPS, I don't see the word "ordered" or "unordered" used as identifiers, except for some underedCopy and related unit tests.

bradcray commented 3 years ago

The following comment is arguably more related to https://github.com/Cray/chapel-private/issues/1393 than it is to this issue, but since foriter has been mentioned here and this conversation is lively today...

Currently, my main hesitation about options like for unordered or forall ordered relates to why I ended up proposing foriter as a strawman in this comment (for those who can see it). Specifically, the more I thought about cases like a diamond tiling or "sequential across columns / parallel within columns" iterator, the more they felt like they were neither strictly serial/ordered nor parallel/unordered, but some arbitrary and un-characterize-able mix of the two. In that sense, invoking such iterators felt much more like saying "I've abstracted the details of this loop nest into an iterator for convenience / code cleanliness / reuse, but it doesn't fall cleanly into the categories of 'serial' or 'parallel'. So please just invoke it and do whatever it says." From a compilation perspective, the implication to me is that the compiler should inline and reason about the iterator's loop nests directly rather than trying to characterize them via the keyword(s) used to invoke the iterator.

That's what suggested to me that maybe we needed an equivalent of the now-antiquated call statement ("invoke this subroutine"), yet for loop, rather than straight-line, contexts. Essentially a loop form that says "invoke this iterator and do whatever it says; it may have elements of serial and/or parallel execution within it, and I'm not really able to describe them in any meaningful way here so you'll have to refer to the iterator itself for details."

e-kayrakli commented 3 years ago

I am still seeing foriter as a very slight variation from forall, just like I see foreach as a slight variation from for.

A non-rhetorical question: In a world with foriter, can I run a foriter loop over a fully parallel and order-independent iterator, like a range iterator? Wouldn't it behave exactly as forall in that scenario? Or do we expect foriter to check something on the iterator that says "this iterator has some ordering requirements"?

My current thinking is that; if the former, then it'd look like we would be adding a new keyword that has a significant overlap with an existing one. If the latter, then we'd be adding a new keyword only for error checking.

bradcray commented 3 years ago

In a world with foriter, can I run a foriter loop over a fully parallel and order-independent iterator, like a range iterator?

That depends. If it's written as a single iterator, like:

iter rangeLikeIter() {
  ... do something here that's like a range iterator ...
}

then yes, you can. If instead, it's a family of iterators with serial, standalone, leader, and follower variants, then I'd say "no" because it's not clear which variant you're requesting be invoked.

(This also relates to why I was wrestling with whether it ever makes sense to run the diamond or allColumns iterators in a zippered manner, where I'm guessing it pretty much doesn't).

mppf commented 3 years ago

Just for anybody reading this who desires more context about allColumns or diamond tiling iterators:

Currently, a forall loop indicates the property that the loop itself is order independent. As a result we can do the forall-unordered optimization. But, we can think of examples where one would like to have parallel loops that do not carry this property. We have brought up two - diamond tiling and allColumns. Note - the issue with these iterators has more to do with the forall-unordered optimization than with vectorization or GPU support.

Information about the Diamond Tiling iterator can be found here:

https://www2.cs.arizona.edu/~ianbertolacci/publications/ICS2015-Paper-Parameterized_Diamond_Tiling_for_Stencil_Computations_with_Chapel_Parallel_Iterators.pdf

and e.g.

https://github.com/chapel-lang/chapel/blob/master/test/studies/colostate/Jacobi1D-DiamondByHand-Chapel_static.chpl

Here is the allColumns iterator which I constructed as an example.

iter allColumns(...standalone..., width: int, height: int) {
  forall x in 0..<width {
    for y in 0..<height {
      yield (x,y);
    }
  }
}

We suppose that it is used by a loop that assumes that the elements within each column are handled in order, for example:

forall (x,y) in allColumns() {
  ColumnTotals[x] += Image[x,y];
}

This loop is "wrong" if forall means that the loop body is order independent (as it is currently defined to do) but could be OK if we had something like foriter that did not indicate that property about the loop itself.

mppf commented 3 years ago

A non-rhetorical question: In a world with foriter, can I run a foriter loop over a fully parallel and order-independent iterator, like a range iterator? Wouldn't it behave exactly as forall in that scenario? Or do we expect foriter to check something on the iterator that says "this iterator has some ordering requirements"?

My current thinking is that; if the former, then it'd look like we would be adding a new keyword that has a significant overlap with an existing one. If the latter, then we'd be adding a new keyword only for error checking.

I think it'd be reasonable to add an alternative iterator tag that is similar to standalone and that foriter would invoke it (and it could fall back on a serial iterator). I think it's important that the iterator be able to communicate it should only be run with foriter.

This is arguably "adding a new keyword only for error checking" but I think we have lots of keywords only for error checking. Isn't programming language design in large part about trying to make reasonable things easy to express and unreasonable things easy for the compiler to detect and report errors on?

Anyway, I don't think it is wrong for us to add a keyword only for error checking. In fact we have already done that with the lifetime keyword which is strictly only used to impact compile-time error checking. I think it is something we have been willing to do and I think it's reasonable for us to do again if the situation requires it.

However. I don't really believe your argument that foriter would only be added for error checking. If we do not do something about the current situation, one cannot write the parallel loops over allColumns or the diamond tiling iterator, because they will be incorrect (because forall carries the property "the loop itself is order independent"). So, we would be adding foriter so that it is possible to write these patterns at all.

mppf commented 3 years ago

But all of that is arguably getting away from the question at hand. I think the question at hand is this:

Should we add foreach/foriter or should we add for unordered and forall ordered ?

(Note that while I think we are relatively settled on the specific keyword foreach, I think foriter is still a straw-person. Also note that while other ideas have been proposed, at this point I think these two are the ones we are actually considering.)

Here is what I see as the important tradeoffs (and I am happy to update this as discussion continues):

foreach/foriter:

for unordered/forall ordered:

mppf commented 3 years ago

Another way to think of forall ordered is that it indicates a partial order (rather than a total order) on the iterations. However it is my understanding that order by itself is usually interpreted as "total order" rather than "partial order".

So really we would want forall partialOrdered or forall notUnordered or something like that but these multi-word keywords don't seem great here in comparison to foriter.

mppf commented 3 years ago

I guess instead of forall ordered we could use forall graded. Which means approximately the right thing according to one English definition but probably most North Americans will think of school grades instead. We could consider forall partial but it doesn't communicate the concept very well on its own (since it doesn't indicate that it's the ordered-ness of iterations that is partial - one might think you have to write this to be able to break from a forall, or something).

e-kayrakli commented 3 years ago

So really we would want forall partialOrdered or forall notUnordered or something like that but these multi-word keywords don't seem great here in comparison to foriter.

Totally agree.

I don't have any counterarguments against any of this. It seems to me like what is more important to me is not as important for others and vice versa. And what seemed more important to me (or at the very least my main motivation for objecting) is that we would be adding 1-2 new for loop flavors to the language, where we already have 3. And those additions doesn't seem to be distinct enough from the existing ones.

Just wanted to make it clear that if what I am proposing isn't as appealing to all of us as they are to me (which seems to be the case), I have no major issues with foreach and maybe foriter as I can see that there are some real good arguments for them, too. I am less convinced with the name foriter, but I don't have a good alternative at the moment.

bradcray commented 3 years ago

Here are my current thoughts:

mppf commented 3 years ago

forwave 👋

cassella commented 3 years ago

"Some Order-Obeying Threads Here", written forsooth

mppf commented 3 years ago

At this point I think we have enough people who prefer foreach that I'm expecting to go ahead with that.

foriter or forwave or whatever is not settled. I actually like forwave because it's a bit more descriptive about what its purpose.

mppf commented 3 years ago

Should foreach create shadow variables like a forall and so require with (ref x) when modifying an outer variable in the loop body (potentially in parallel) ?

Since the same issue with forall exists in vector code I would lean towards "yes"

bradcray commented 3 years ago

Since the same issue with forall exists in vector code I would lean towards "yes"

I buy that argument.

bradcray commented 3 years ago

@mstrout : Following up on our conversation, I'm seeing on this issue that foriter has been proposed before, which I'd forgotten. Capturing another idea I had today, foracross came to mind because I think in some contexts, doacross loops have been ones that have implied a partial ordering? (would need to check my memory on that).

I'm also realizing that, today, we effectively use for to invoke (say) parallel leader iterators through forwarding; depending on you feel, that either suggests a practice that we should fix once we have this forwave/foriter/foracross concept figured out; or it suggests a precedent for using for to invoke parallel iterators that can be inlined (but I must say that it's never sat well with me... has seemed odd).

bradcray commented 3 years ago

One other side note. I've recently been realizing that there's a relationship between foriter and the yieldall concept we've discussed at times, in that:

yieldall someIter();

is a lot like:

foriter i in someIter() do
  yield i;

Briefly, I was thinking/hoping that yieldall would be sufficient for foriter cases, and less controversial of a concept, but since it's strictly simpler (invokes the iterator without naming the values being yielded; no opportunity for task intents), I think they're still distinct, unfortunately.

stonea commented 3 years ago

The question about "how to mark loops as order independent" has been settled with the answer being "use the foreach statement". So I'm closing this issue.

mppf commented 3 years ago

@stonea - unless I missed it, we still need to create an issue about forwave and what exactly the way to access that functionality is - right? Thanks

stonea commented 3 years ago

I just created: #18587

Feel free to edit\expand upon\flesh out the issue description if you feel I'm missing important details (I don't know the full context of this original discussion).