Open dmk42 opened 7 years ago
A related question: should the serial statement prevent vectorization? #11798
I just edited the hypothesis to take into account task-local storage, which also prevents vectorization.
Here is a sketch of some code you'd think would be vectorizeable but would fail unless task-local-storage is detected
config param n = 100000;
var A:[1..n] relaxed atomic int;
forall i in 1..n {
A[i] = task_id;
var nThisId = 0;
if A[1] == task_id then nThisId += 1;
if A[2] == task_id then nThisId += 1;
if A[3] == task_id then nThisId += 1;
...
if A[n] == task_id then nThisId += 1;
A[i] = -1;
assert(nThisId == 1);
}
Nonvectorized for loops in parallel iterators. The iterator itself might assume that vectorization is not occurring in this loop.
I am observing this case with code in the Random module:
In particular the loop
for i in innerRange do
yield randlc(resultType, cursor);
has a loop-carried dependency and is not order-independent at all. (note that randlc
has inout
intent for the second argument). This makes me think that the loops inside of iterator implementations need to opt-in to being vectorizeable somehow (say, with vectorizeOnly) or to out-out of it being vectorizeable (with some other mechanism).
The other alternative is for the leader iterator to generate a different follower per vector lane.
I think that the compiler code here:
is doing this and it came from PR #1490.
It seems to me that, to the extent that forall implies "able to be vectorized", it's really saying something about the user's code in the body of the loop, not the underlying implementation of the loop iterators themselves. So, for example, in:
forall i in myParIter(1..n) do
A[p1[i]] = B[p2[i]];
the forall
would suggest that there aren't any dependencies between the permutation vectors p1
and p2
that should break vectorization, but not anything about the loops used to implement myParIter()
which I think may or may not be amenable to vectorization through normal analysis, independent of the loop body (e.g., ignoring what happens after the yield
point). So if myParIter()
just had loops like for i in mylo..myhi do yield i;
it would be vectorizable, but if it had a loop like the one in a chained random number generator, it wouldn't be.
Where I think my original characterization breaks down is in the presence of task-local variables caused by with-intents like in
, reduce
, var
/const
, etc. since those could similarly be used to generate per-task loop-carried dependences, like:
forall in in myParIter(1..n) with (var randCursor = seedRandCursor()) do
A[p1[i]] = getNextRand(randCursor);
@bradcray - thanks for your response
which I think may or may not be amenable to vectorization through normal analysis
I'm not sure what normal analysis refers to specifically. Today, I think the Chapel compiler front-end relies primarily upon the user indicating whether or not something is parallel. We don't have lots of logic to check for order independent loops etc. There are many cases in which a C compiler (or LLVM opts) will vectorize, but that does not help with my current task. What I'm trying to do is to adjust the Chapel compiler to do a better job of giving vector hints to LLVM and RV. I want to give hints in the situations that the user has effectively requested vectorization.
We could certainly check for some easy cases - like the case in which the follower iterator has a simple loop along the lines of for i in mylo..myhi do yield i
. But I don't think we want the Chapel compiler front-end to get into checking for loop-carried dependencies.
For this reason - and also the fact that implementing a parallel leader/follower iterator is a relatively advanced feature - I am leaning towards only having the Chapel compiler give vectorization hints for such loops in the event that the iterator author somehow opts in to it. For example by writing the loop with vectorizeOnly
.
But even vectorizeOnly
is not quite right, because it depends on how the iterator is invoked whether or not that loop can be marked for vectorization. For example, in serial zippering we can have follower iterators run as part of a for
loop. What the iterator author needs to indicate is that the loop in the iterator itself could be vectorized. If the use of the iterator is a vectorizeOnly
or forall
then it would in fact be vectorized; but if it was a (zippered) for
it would not.
Does having iterator authors indicate this information make sense to you? (Whether or not we choose something along the lines of vectorizeOnly
to do so?)
Where I think my original characterization breaks down is in the presence of task-local variables caused by with-intents like in, reduce, var/const, etc. since those could similarly be used to generate per-task loop-carried dependences, like:
forall in in myParIter(1..n) with (var randCursor = seedRandCursor()) do
Yes, we already have analysis to look for this sort of thing. We basically disable vectorization in such cases but another interesting approach would be to create a different copy of randCursor
for each vector lane.
I'm not sure what normal analysis refers to specifically.
In writing this, I was thinking of auto-vectorization in the back-end compiler (though the Chapel compiler could potentially help with this kind of analysis as well, with effort). More broadly, I meant "normal auto-vectorization analysis and optimization" (whoever does it).
in serial zippering we can have follower iterators run as part of a for loop.
By serial zippering, do you mean for (a,b) in zip(A,B) ...
(which I don't think should ever invoke a follower iterator, unless the iterator author chose to have their serial iterator forward to the follower iterator?) or serial true forall (a,b) in zip(A,B) ...
(which probably would, but that might arguably be OK since it's still within a forall loop).
I'm still OK with having a vectorizeOnly()
style iterator in the language/library and making use of it in cases like this (assuming we can make it behave appropriately, obviously).
By serial zippering, do you mean for (a,b) in zip(A,B) ... (which I don't think should ever invoke a follower iterator, unless the iterator author chose to have their serial iterator forward to the follower iterator?) or serial true forall (a,b) in zip(A,B) ... (which probably would, but that might arguably be OK since it's still within a forall loop).
Sounds like I was wrong about that. (I was talking about the first case).
I'm still OK with having a vectorizeOnly() style iterator in the language/library and making use of it in cases like this (assuming we can make it behave appropriately, obviously).
I'll have a look at doing this.
Sounds like I was wrong about that. (I was talking about the first case).
We have talked about introducing serial leader/followers to implement zippered for loops as a means permitting users to write unbounded follower iterators that are happy to conform to the leader (like, if a user wanted to write their own unbounded range which will yield as many things as the leader needs), but haven't gone down that path yet. When we have discussed that, it's been theorized that the same follower iterator might be usable for both zippered for and forall loops, so your point could suggest that that was naive in a forall-implies vectorization world; or it might suggest that the follower iterator should get passed some sort of cue as to whether it's being called in a for or forall context...?
When we have discussed that, it's been theorized that the same follower iterator might be usable for both zippered for and forall loops, so your point could suggest that that was naive in a forall-implies vectorization world; or it might suggest that the follower iterator should get passed some sort of cue as to whether it's being called in a for or forall context...?
Well, the vectorization will only make sense (AFAIK) if iterator inlining occurs. In that event, the compiler can see both the loop in the follower iterator and the loop invoking it; and it could mark the generated loop as not-vectorizeable if the invoking loop is a for
- even if the follower iterator had a vectorizeOnly
loop. But as you say, this is not something we have to worry about right away.
I can't literally use vectorizeOnly
because it is defined to always run the serial iterator but the common case I need to call it is in a standalone or follower iterator when forwarding to another standalone or follower (so I would need something to run the standalone or follower iterator).
I'm not understanding. I thought the idea was to apply vectorizeOnly()
to the (currently) serial for loops within standalone or follower iterators (say). So where the range's follower iterator currently says something like:
for i in low..high by stride {
yield i;
}
(where r
is a range), I thought the plan was to make it say:
forall i in vectorizeOnly(low..high) by stride {
yield i;
}
In your edit, you mention the case of forwarding to another standalone or follower as being the problematic case—can you make that more concrete?
For example we have an iterator forward in ChapelArray here:
Or this simple example from my own tests:
iter testveciter(param tag: iterKind, followThis) where tag == iterKind.follower
{
for i in (1..n).these(tag=tag, followThis=followThis) {
yield i;
}
}
This forwarding pattern is problematic.
for
loop and that implies that it is not vectorizablefor i in vectorizeOnly(tag=tag, followThis=followThis, _value)
then I immediately run into problems because vectorizeOnly
always uses the serial iterator.The problem comes up with iterator forwarding patterns for standalone and follower iterators.
Perhaps this iterator forwarding pattern (which is identifiable by the tag
argument being passed) could propagate the order-independent-ness from the forwarded-to iterator. I don't think it would be general enough to rely on the body of the loop consisting entirely of yield i
. Additionally I can imagine a case where one might want to do the iterator forward but also have a non order independent loop. For example, what if the follower iterator for something looked like this:
// imagine serial/standalone/leader iterators too
iter myiter(param tag: iterKind, followThis) ref
where tag == iterKind.follower {
var cursor: real;
for i in _value.these(tag, followThis) {
cursor = nextcursor(cursor);
yield (i, cursor);
}
}
So I think even in an iterator forward situation, the author of the iterator using another iterator needs to be able to turn off order-indpendent-ness of the loop. Perhaps they need to always opt in to vectorization (so not indicating order-indpendent-ness would turn it off); or perhaps the default should be based upon the forwarded-to iterator with a way to explicitly turn it off.
Conceptually, I wish that we had something that could mark a for
loop as order independent and not have any other significant impact (semantically or syntactically). For example, if we had for
and vecfor
keywords; then these for
loops that are themselves order independent could simply be marked with vecfor
, both in current vectorizeOnly
use cases as well as in these iterator forwarding scenarios. (Adding a new keyword isn't necessarily an appropriate solution here; the point is just that it would make it trivially easy for these loops to indicate order-independent-ness).
I suppose we could consider changing forall i in vectorizeOnly(...)
to do nothing and be the same as forall i in ...
. I think this would allow vectorizeOnly
to be used in these iterator forwarding scenarios. Or, we could add a new variant similar to vectorizeOnly
, like vectorizeOK
, that doesn't change whether something is parallel but does indicate order independence. But iterator forwarding isn't particularly obvious/intuitive and adding vectorizeOnly
to it will only make it harder. Additionally I have some concern about the impact to compile time from adding one level of indirection to most of our iterators.
Looking at uses of vectorizeOnly in the tests, I can see one place where making forall vectorizeOnly the same as forall would make it harder to write a pattern:
Here it appears that forall
is used with vectorizeOnly
so that it can include a reduction. Perhaps that is evidence that for
loops should allow intents.
Anyway I am not sure what a reasonable approach is here. In the implementation effort, I think I will take the near-term step of relying on a function-level pragma on key internal iterators.
(Posting after discussion with @ronawho). It sounds like @ronawho, @bradcray, and I are on the same page: the iterator authors are responsible for marking the loops in the iterator as order independent when appropriate.
I'm thinking that the compiler will include a simple analysis that finds loops like for i in _ {yield i;}
and marks those as order independent (the key here being that the body of the loop is just a yield). This will cover a lot of the iterator forwarding cases we have.
Then, we'll need some kind of syntax for marking a loop as order independent more generally. vectorizeOnly
isn't quite the right thing because 1) the iterator author needing to use it isn't exactly opting in to vectorization and 2) it currently only runs serial iterators.
So what might the syntax be? I think it'd be nice if the syntax made it easy to mark iterator forwarding and zippered iteration cases. How would the following iterator forwarding and zippered cases be marked order independent?
for i in (1..n).these(tag=tag, followThis=followThis)
for (i,j) in zip(expr(), b.c.iterate(x))
for i in orderIndependent((1..n).these(tag=tag, followThis=followThis))
for (i,j) in orderIndependent(zip(expr(), b.c.iterate(x)))
for i in orderIndependent((1..n).these(tag=tag, followThis=followThis))
for (i,j) in orderIndependent(expr(), b.c.iterate(x))
for
keyword to forall(serial)
forall(serial) i in (1..n).these(tag=tag, followThis=followThis)
forall(serial) (i,j) in zip(expr(), b.c.iterate(x))
for
keyword to serial forall
serial forall i in (1..n).these(tag=tag, followThis=followThis)
serial forall (i,j) in zip(expr(), b.c.iterate(x))
@order-independent
for i in (1..n).these(tag=tag, followThis=followThis)
@order-independent
forall (i,j) in zip(expr(), b.c.iterate(x))
(see also #14141)
for i in (1..n).these(tag=tag, followThis=followThis)
for (i,j) in zip(expr(), b.c.iterate(x))
but... how to write these using while is not so clear.
(other ideas?)
My two cents:
I see a vectorized loop much closer to for
than forall
, so I am not a big fan of forall
-based options.
I like the annotation-based approach most, but it maybe too big a design discussion compared to others. On a side note, I wonder whether it would be possible to extend our pragma support for such cases:
pragma "order independent"
for i in (1..n).these(tag=tag, followThis=followThis)
Looks better to me.
orderIndependent
, I prefer if we keep zip
. I see them as orthogonal and as such zip
should remain. If orderIndependent
is too long, we can think about shorter but arguably less clear alternatives: orderless
, unordered
, noorder
, reorder
I see a vectorized loop much closer to for than forall, so I am not a big fan of forall-based options.
Engin, can you say more about why this is? What makes it seem more for-like than forall-like? Is it simply that vectorizing compilers have typically taken the approach of auto-vectorizing serial for loops, or is there more to it than that?
On a side note, I wonder whether it would be possible to extend our pragma support for such cases:
I think others know this about me, but whenever our (or another language's) solution to something is "we'll do this through pragmas", it makes me think "they stopped designing the language too soon."
Regarding Michael's comments on the forwarding cases a few weeks back:
yieldall
concept in the language (rather than using a serial for loop to forward a potentially parallel iterator) help with this situation at all?param
argument that the author of the iterator builds into their implementation? The main drawback I can think of is that I want to forward to an iterator that didn't think to build that capability into itself and uses the wrong vectorization choice for me. But then presumably I shouldn't/couldn't forward to that iterator to begin with. Is that inherently problematic?would having a yieldall concept in the language (rather than using a serial for loop to forward a potentially parallel iterator) help with this situation at all?
Sure it would, because a very common situation in these iterators is to forward to another iterator. Supposing you would write something like yieldall otherIterator
- you might not even need to write a follower iterator. If you did need to write a follower iterator then maybe you could write yieldall (1..n, followThis=followThis)
(or something) and the yieldall
part would make it clear that the order-independent-ness should be taken from the other iterator. (Unlike for i in (1..n).these(tag=tag, followThis=followThis)
it does not imply serial execution).
But, to large part, the pattern matching on iterator forwarding covers this case OK
I'm thinking that the compiler will include a simple analysis that finds loops like for i in _ {yield i;} and marks those as order independent (the key here being that the body of the loop is just a yield). This will cover a lot of the iterator forwarding cases we have.
Could the "sometimes vectorizable / sometimes not" aspect of an iterator be indicated via an explicit param argument that the author of the iterator builds into their implementation? The main drawback I can think of is that I want to forward to an iterator that didn't think to build that capability into itself and uses the wrong vectorization choice for me. But then presumably I shouldn't/couldn't forward to that iterator to begin with. Is that inherently problematic?
I'm not sure what you're referring to specifically. But generally speaking I don't think this situation comes up that often. And anywhere we could pass a param
argument we could make a different overload depending on it, so that the loops are either order-independent or not. That said - I think I'd need to see an example to understand the idea better.
Engin, can you say more about why this is? What makes it seem more for-like than forall-like? Is it simply that vectorizing compilers have typically taken the approach of auto-vectorizing serial for loops, or is there more to it than that?
I think that's somewhat related to where I come from.
At a fundamental level forall
being a data-oriented parallel loop implies to me that it is locality-aware. From more of a Chapel implementer perspective, it is a loop where a special iterator on a data type is invoked that may do many custom things as well as moving tasks around.
OTOH, the parallelism implied by vectorization is almost a sequential operation from a high-level programming language's perspective (and this is similar to what you are saying). It is only parallel by virtue of vector registers and not some task/thread-level parallelism introduced by Chapel's language stack.
OTOH, the parallelism implied by vectorization is almost a sequential operation from a high-level programming language's perspective (and this is similar to what you are saying). It is only parallel by virtue of vector registers and not some task/thread-level parallelism introduced by Chapel's language stack.
My 2 cents: You probably want two different ways to drive vectorization:
forall
loop that treats the loop body as SIMT-parallel code. Vectorization in this setting basically means executing the instructions of the loop body in parallel with a lock-step schedule. Different to 1. this is not just a hint and not even a loop construct anymore - it's SIMT programming.Option 1. helps the compiler to auto-vectorize loops that have serial semantics. I see the appeal but it's also limiting because you are stuck with serial loop iteration semantics. Option 2. allows you to do ISPC/RV - like programming where you can have warp shuffles, reductions across threads realized with SIMD instructions, etc. This is a powerful programming model that is not possible with Option 1 alone.
@simoll - thanks very much for your comment! It is very helpful.
I was also thinking about how to respond to @e-kayrakli's post because I feel it is starting to take sides on an issue that we have so far not really engaged with. The question is - for a Chapel user - is a vector lane within a forall loop a task?
I don't think this is something we've really debated yet. It is also something that is typically answered one way in a CPU/SIMD world and a different way in a GPU/SIMT one.
The "Note about task-local storage" in the issue description gets to part of it. The language-facing features that impact it are forall intents - specifically in
intent (which one might view as creating task local storage) and reductions.
We will need in
and reduce
intents on a forall
to function as we improve vectorization. In order for that to work, there are two strategies:
in
intent variables).I think that in order to get to reasonably good GPU code generation, we have to choose 2 above.
This gets into a philisophical question of whether or not we would call each vector lane a task or not. I would personally lean towards "yes" - I would consider them a restricted sort of task. However we could do 2 above without doing that, because the language itself is not tied to the term "task local variable" for an in
intent on a forall (say). (Of course, either way, there are documentation changes to do.)
Generally speaking, I'm open to thinking of vector lanes as tasks and having current "task-local" concepts extend to vector lanes. That said, I'm not an expert in vectorization, so this is just my gut reaction, not a particularly well-informed thought.
In any case, I don't think we should feel particularly constrained to thinking of a task as "something that's executed by a qthread" (when CHPL_TASKS=qthreads), but should think more generally about a task as "some computation that could/should be done in parallel with other tasks" by whatever mechanism.
Capturing a few stray thoughts after discussing this comment with @mppf:
forall
already suggests "these iterations are order-independent and could/should be executed in parallel" to me (and to the spec). It does not necessarily say "use qthread-style tasks to implement that parallelism" but rather "invoke the iterand's parallel iterator method." To that end forall ... orderIndependent...
seems a little redundant to me. Meanwhile for ... orderIndependent
seems less redundant, but also a little weird: "do this serially, but not really (wink wink, nudge nudge)." Particularly if we use the previous few comments to think of vector lanes as an alternative mechanism for implementing tasks, it seems like we're abusing for loops a little (and, practically speaking, for loops don't current support with + in
or with + reduce
task intents at present).
looking at some of the serial forall
or forall(serial)
variants, made me wonder whether we'd want to introduce a foreach
keyword that essentially was a midpoint between the for
and forall
iterators and essentially signaled "don't spawn off multiple qthreads-style tasks" but feel free to use other forms of HW parallelism (e.g., vector lanes, SIMD, GPUs) to implement this iterator. Such an approach might also go hand-in-hand with tackling this by introducing an optional new kind of iterator overload in that "for invokes the serial iterator", "forall invokes a parallel iterator", "foreach invokes a vector-/GPU-friendly iterator."
OTOH, if all we're doing is supplying a recipe for how a parallel loop with no ordering constraint on its iterations should be implemented, why would that not simply be a specific parallel iterator which happens not to spawn any tasks? (since any parallel iterator can choose not to do so). I.e., maybe rather than forall i in orderIndependent(...)
I should be saying something like forall i in simdStyleTasksOnlyPlease(...)
to avoid the redundancy (with the understanding that that's not actually a good name for it).
generally speaking, I fear that whatever approach I back most fervently here, my opinion will change once we have more experience with it and I see how it impacts existing iterators (specifically ones on range, DR domains and arrays, and block domains and arrays). To that end, I feel most comfortable taking the least impactful (on the language) approaches for now (e.g., a user-level iterator or iterator-looking thing like orderIndependent
by whatever name; or the current pragma Michael has) in order to get that experience and be able to learn from it before locking anything too impactful in (like a new keyword or language feature).
every time I see for i in ... do yield i
I regret that we don't have a yieldall
concept yet, despite having kicked it around for years (and years before the issue was ever filed). I realize that's resaonably orthogonal to the discussion here, but since this issue happens to contain such examples...
capturing a note from Michael when I was wondering whether yieldall
would help or automatic propagation of order-independence would be sufficient, he pointed out that an array iterator written like:
iter foo() {
for i in orderIndIter() do
yield myData[i];
}
would not be a candidate for yieldall
or automatic propagation because even though the iterator being invoked may be known to be order independent, the fact that we're not doing just yield i
but yield ...some function of i...
means that the iterator author would need to step in and assert that this is still order-independent. As a counter-example, if I wrote:
iter foo() {
for i in orderIndIter() do
yield infile.read(int);
}
I'm still invoking the same order-independent iterator, but presumably can't have the iterations execute in arbitrary order due to the use of a file. This was a helpful example to think about.
The previous thought came from thinking that "95% of our follower iterators probably are order-independent" which made me wonder "to what extent could we just make that the easy/common case?"
At the same time, I don't want to go so far as to say for
means "order-independent" within the context of an iterator (Michael's for always is order independent in an iterator; use while otherwise) because I think it's too subtle and results in weird behavioral changes when taking top-level code and refactoring it into an iterator. So that one is off the table for me.
The recent part of this discussion (indicating order independence without creating qthreads-type tasks) is happening in #16404.
This started as a conversation between @dmk42 and @mppf . It is being placed in an issue so we can expand on it and refer back to it, and so others can comment if desired.
Hypothesis: Writing a
forall
loop that contains no synchronization or task-local storage constructs amounts to an assertion that there are no loop-carried dependencies. Therefore, it should be safe to vectorize these.The qualification that the loop contains no synchronization constructs is important. A
forall
loop might contain a critical section guarded by sync variables. In the general case, this would preclude vectorization because there would be portions of the loop during which other iterations are not expected to be running.Tricky cases to think through:
sum += a[i]
is not recognized by the code generator as a reduction, attempting to vectorize the code would cause a data race. If it is recognized as a reduction,sum
can be promoted to the vector width and a final combining step can be performed after the end of the loop, assumingsum
is not used elsewhere within the loop. Note that the compiler currently creates loop-carried dependencies for reductions in forall loops.for
loops in parallel iterators. The iterator itself might assume that vectorization is not occurring in this loop.forall
loop. This may benefit from some ideas from the GPU world. If we had a way to query the number of vector lanes available, and a way to query which lane we are in, we could store such variables in an array or tuple containing vector width elements.Note about task-local storage: when using task local storage, a forall loop might create dependencies from one iteration to another when the iterations run on the same task. A reasonable alternative strategy here would be to make common idioms for task-local-storage in forall loops create per-vector-lane storage. Forall intents using
in
, declaring a variable, or doing a reduction should probably work this way.