chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

Vectorization considerations #7761

Open dmk42 opened 6 years ago

dmk42 commented 6 years ago

This started as a conversation between @dmk42 and @mppf . It is being placed in an issue so we can expand on it and refer back to it, and so others can comment if desired.

Hypothesis: Writing a forall loop that contains no synchronization or task-local storage constructs amounts to an assertion that there are no loop-carried dependencies. Therefore, it should be safe to vectorize these.

The qualification that the loop contains no synchronization constructs is important. A forall loop might contain a critical section guarded by sync variables. In the general case, this would preclude vectorization because there would be portions of the loop during which other iterations are not expected to be running.

Tricky cases to think through:

Note about task-local storage: when using task local storage, a forall loop might create dependencies from one iteration to another when the iterations run on the same task. A reasonable alternative strategy here would be to make common idioms for task-local-storage in forall loops create per-vector-lane storage. Forall intents using in, declaring a variable, or doing a reduction should probably work this way.

mppf commented 5 years ago

A related question: should the serial statement prevent vectorization? #11798

dmk42 commented 5 years ago

I just edited the hypothesis to take into account task-local storage, which also prevents vectorization.

mppf commented 5 years ago

Here is a sketch of some code you'd think would be vectorizeable but would fail unless task-local-storage is detected

config param n = 100000;

var A:[1..n] relaxed atomic int;
forall i in 1..n {
  A[i] = task_id;
  var nThisId = 0;

  if A[1] == task_id then nThisId += 1;
  if A[2] == task_id then nThisId += 1;
  if A[3] == task_id then nThisId += 1;
  ...
  if A[n] == task_id then nThisId += 1;

  A[i] = -1;

  assert(nThisId == 1);
}
mppf commented 4 years ago

Nonvectorized for loops in parallel iterators. The iterator itself might assume that vectorization is not occurring in this loop.

I am observing this case with code in the Random module:

https://github.com/chapel-lang/chapel/blob/3c724d043d9c3551761bd6752e12401a1b9427bb/modules/standard/Random.chpl#L2838-L2863

In particular the loop

          for i in innerRange do
            yield randlc(resultType, cursor);

has a loop-carried dependency and is not order-independent at all. (note that randlc has inout intent for the second argument). This makes me think that the loops inside of iterator implementations need to opt-in to being vectorizeable somehow (say, with vectorizeOnly) or to out-out of it being vectorizeable (with some other mechanism).

The other alternative is for the leader iterator to generate a different follower per vector lane.

I think that the compiler code here:

https://github.com/chapel-lang/chapel/blob/3c724d043d9c3551761bd6752e12401a1b9427bb/compiler/resolution/lowerIterators.cpp#L1483-L1508

is doing this and it came from PR #1490.

bradcray commented 4 years ago

It seems to me that, to the extent that forall implies "able to be vectorized", it's really saying something about the user's code in the body of the loop, not the underlying implementation of the loop iterators themselves. So, for example, in:

forall i in myParIter(1..n) do
  A[p1[i]] = B[p2[i]];

the forall would suggest that there aren't any dependencies between the permutation vectors p1 and p2 that should break vectorization, but not anything about the loops used to implement myParIter() which I think may or may not be amenable to vectorization through normal analysis, independent of the loop body (e.g., ignoring what happens after the yield point). So if myParIter() just had loops like for i in mylo..myhi do yield i; it would be vectorizable, but if it had a loop like the one in a chained random number generator, it wouldn't be.

Where I think my original characterization breaks down is in the presence of task-local variables caused by with-intents like in, reduce, var/const, etc. since those could similarly be used to generate per-task loop-carried dependences, like:

forall in in myParIter(1..n) with (var randCursor = seedRandCursor()) do
  A[p1[i]] = getNextRand(randCursor);
mppf commented 4 years ago

@bradcray - thanks for your response

which I think may or may not be amenable to vectorization through normal analysis

I'm not sure what normal analysis refers to specifically. Today, I think the Chapel compiler front-end relies primarily upon the user indicating whether or not something is parallel. We don't have lots of logic to check for order independent loops etc. There are many cases in which a C compiler (or LLVM opts) will vectorize, but that does not help with my current task. What I'm trying to do is to adjust the Chapel compiler to do a better job of giving vector hints to LLVM and RV. I want to give hints in the situations that the user has effectively requested vectorization.

We could certainly check for some easy cases - like the case in which the follower iterator has a simple loop along the lines of for i in mylo..myhi do yield i. But I don't think we want the Chapel compiler front-end to get into checking for loop-carried dependencies.

For this reason - and also the fact that implementing a parallel leader/follower iterator is a relatively advanced feature - I am leaning towards only having the Chapel compiler give vectorization hints for such loops in the event that the iterator author somehow opts in to it. For example by writing the loop with vectorizeOnly.

But even vectorizeOnly is not quite right, because it depends on how the iterator is invoked whether or not that loop can be marked for vectorization. For example, in serial zippering we can have follower iterators run as part of a for loop. What the iterator author needs to indicate is that the loop in the iterator itself could be vectorized. If the use of the iterator is a vectorizeOnly or forall then it would in fact be vectorized; but if it was a (zippered) for it would not.

Does having iterator authors indicate this information make sense to you? (Whether or not we choose something along the lines of vectorizeOnly to do so?)

mppf commented 4 years ago

Where I think my original characterization breaks down is in the presence of task-local variables caused by with-intents like in, reduce, var/const, etc. since those could similarly be used to generate per-task loop-carried dependences, like:

forall in in myParIter(1..n) with (var randCursor = seedRandCursor()) do

Yes, we already have analysis to look for this sort of thing. We basically disable vectorization in such cases but another interesting approach would be to create a different copy of randCursor for each vector lane.

bradcray commented 4 years ago

I'm not sure what normal analysis refers to specifically.

In writing this, I was thinking of auto-vectorization in the back-end compiler (though the Chapel compiler could potentially help with this kind of analysis as well, with effort). More broadly, I meant "normal auto-vectorization analysis and optimization" (whoever does it).

in serial zippering we can have follower iterators run as part of a for loop.

By serial zippering, do you mean for (a,b) in zip(A,B) ... (which I don't think should ever invoke a follower iterator, unless the iterator author chose to have their serial iterator forward to the follower iterator?) or serial true forall (a,b) in zip(A,B) ... (which probably would, but that might arguably be OK since it's still within a forall loop).

I'm still OK with having a vectorizeOnly() style iterator in the language/library and making use of it in cases like this (assuming we can make it behave appropriately, obviously).

mppf commented 4 years ago

By serial zippering, do you mean for (a,b) in zip(A,B) ... (which I don't think should ever invoke a follower iterator, unless the iterator author chose to have their serial iterator forward to the follower iterator?) or serial true forall (a,b) in zip(A,B) ... (which probably would, but that might arguably be OK since it's still within a forall loop).

Sounds like I was wrong about that. (I was talking about the first case).

I'm still OK with having a vectorizeOnly() style iterator in the language/library and making use of it in cases like this (assuming we can make it behave appropriately, obviously).

I'll have a look at doing this.

bradcray commented 4 years ago

Sounds like I was wrong about that. (I was talking about the first case).

We have talked about introducing serial leader/followers to implement zippered for loops as a means permitting users to write unbounded follower iterators that are happy to conform to the leader (like, if a user wanted to write their own unbounded range which will yield as many things as the leader needs), but haven't gone down that path yet. When we have discussed that, it's been theorized that the same follower iterator might be usable for both zippered for and forall loops, so your point could suggest that that was naive in a forall-implies vectorization world; or it might suggest that the follower iterator should get passed some sort of cue as to whether it's being called in a for or forall context...?

mppf commented 4 years ago

When we have discussed that, it's been theorized that the same follower iterator might be usable for both zippered for and forall loops, so your point could suggest that that was naive in a forall-implies vectorization world; or it might suggest that the follower iterator should get passed some sort of cue as to whether it's being called in a for or forall context...?

Well, the vectorization will only make sense (AFAIK) if iterator inlining occurs. In that event, the compiler can see both the loop in the follower iterator and the loop invoking it; and it could mark the generated loop as not-vectorizeable if the invoking loop is a for - even if the follower iterator had a vectorizeOnly loop. But as you say, this is not something we have to worry about right away.

mppf commented 4 years ago

I can't literally use vectorizeOnly because it is defined to always run the serial iterator but the common case I need to call it is in a standalone or follower iterator when forwarding to another standalone or follower (so I would need something to run the standalone or follower iterator).

bradcray commented 4 years ago

I'm not understanding. I thought the idea was to apply vectorizeOnly() to the (currently) serial for loops within standalone or follower iterators (say). So where the range's follower iterator currently says something like:

          for i in low..high by stride {
            yield i;
          }

(where r is a range), I thought the plan was to make it say:

          forall i in vectorizeOnly(low..high) by stride {
            yield i;
          }

In your edit, you mention the case of forwarding to another standalone or follower as being the problematic case—can you make that more concrete?

mppf commented 4 years ago

For example we have an iterator forward in ChapelArray here:

https://github.com/chapel-lang/chapel/blob/f019df37d7b877b778e9bace663131d16fba5670/modules/internal/ChapelArray.chpl#L2875-L2888

Or this simple example from my own tests:

iter testveciter(param tag: iterKind, followThis) where tag == iterKind.follower
{
  for i in (1..n).these(tag=tag, followThis=followThis) {
    yield i;
  }
}

This forwarding pattern is problematic.

The problem comes up with iterator forwarding patterns for standalone and follower iterators.

Perhaps this iterator forwarding pattern (which is identifiable by the tag argument being passed) could propagate the order-independent-ness from the forwarded-to iterator. I don't think it would be general enough to rely on the body of the loop consisting entirely of yield i. Additionally I can imagine a case where one might want to do the iterator forward but also have a non order independent loop. For example, what if the follower iterator for something looked like this:

// imagine serial/standalone/leader iterators too
iter myiter(param tag: iterKind, followThis) ref
where tag == iterKind.follower {
  var cursor: real;
  for i in _value.these(tag, followThis) {
    cursor = nextcursor(cursor);
     yield (i, cursor);
  }
}

So I think even in an iterator forward situation, the author of the iterator using another iterator needs to be able to turn off order-indpendent-ness of the loop. Perhaps they need to always opt in to vectorization (so not indicating order-indpendent-ness would turn it off); or perhaps the default should be based upon the forwarded-to iterator with a way to explicitly turn it off.

Conceptually, I wish that we had something that could mark a for loop as order independent and not have any other significant impact (semantically or syntactically). For example, if we had for and vecfor keywords; then these for loops that are themselves order independent could simply be marked with vecfor, both in current vectorizeOnly use cases as well as in these iterator forwarding scenarios. (Adding a new keyword isn't necessarily an appropriate solution here; the point is just that it would make it trivially easy for these loops to indicate order-independent-ness).

I suppose we could consider changing forall i in vectorizeOnly(...) to do nothing and be the same as forall i in .... I think this would allow vectorizeOnly to be used in these iterator forwarding scenarios. Or, we could add a new variant similar to vectorizeOnly, like vectorizeOK, that doesn't change whether something is parallel but does indicate order independence. But iterator forwarding isn't particularly obvious/intuitive and adding vectorizeOnly to it will only make it harder. Additionally I have some concern about the impact to compile time from adding one level of indirection to most of our iterators.

Looking at uses of vectorizeOnly in the tests, I can see one place where making forall vectorizeOnly the same as forall would make it harder to write a pattern:

https://github.com/chapel-lang/chapel/blob/f019df37d7b877b778e9bace663131d16fba5670/test/release/examples/benchmarks/lcals/RunVectorizeOnlyRawLoops.chpl#L472-L475

Here it appears that forall is used with vectorizeOnly so that it can include a reduction. Perhaps that is evidence that for loops should allow intents.

Anyway I am not sure what a reasonable approach is here. In the implementation effort, I think I will take the near-term step of relying on a function-level pragma on key internal iterators.

mppf commented 4 years ago

(Posting after discussion with @ronawho). It sounds like @ronawho, @bradcray, and I are on the same page: the iterator authors are responsible for marking the loops in the iterator as order independent when appropriate.

I'm thinking that the compiler will include a simple analysis that finds loops like for i in _ {yield i;} and marks those as order independent (the key here being that the body of the loop is just a yield). This will cover a lot of the iterator forwarding cases we have.

Then, we'll need some kind of syntax for marking a loop as order independent more generally. vectorizeOnly isn't quite the right thing because 1) the iterator author needing to use it isn't exactly opting in to vectorization and 2) it currently only runs serial iterators.

So what might the syntax be? I think it'd be nice if the syntax made it easy to mark iterator forwarding and zippered iteration cases. How would the following iterator forwarding and zippered cases be marked order independent?

for i in (1..n).these(tag=tag, followThis=followThis)
for (i,j) in zip(expr(), b.c.iterate(x))

orderIndependent call wraps iterables part

for i in orderIndependent((1..n).these(tag=tag, followThis=followThis))
for (i,j) in orderIndependent(zip(expr(), b.c.iterate(x)))

orderIndependent call wraps iterables part but replaces zip

for i in orderIndependent((1..n).these(tag=tag, followThis=followThis))
for (i,j) in orderIndependent(expr(), b.c.iterate(x))

change for keyword to forall(serial)

forall(serial) i in (1..n).these(tag=tag, followThis=followThis)
forall(serial) (i,j) in zip(expr(), b.c.iterate(x))

change for keyword to serial forall

serial forall i in (1..n).these(tag=tag, followThis=followThis)
serial forall (i,j) in zip(expr(), b.c.iterate(x))

loop annotation

@order-independent
for i in (1..n).these(tag=tag, followThis=followThis)
@order-independent
forall (i,j) in zip(expr(), b.c.iterate(x))

(see also #14141)

for always is order independent in an iterator; use while otherwise

for i in (1..n).these(tag=tag, followThis=followThis)
for (i,j) in zip(expr(), b.c.iterate(x))

but... how to write these using while is not so clear.

(other ideas?)

e-kayrakli commented 4 years ago

My two cents:

pragma "order independent"
for i in (1..n).these(tag=tag, followThis=followThis)

Looks better to me.

bradcray commented 4 years ago

I see a vectorized loop much closer to for than forall, so I am not a big fan of forall-based options.

Engin, can you say more about why this is? What makes it seem more for-like than forall-like? Is it simply that vectorizing compilers have typically taken the approach of auto-vectorizing serial for loops, or is there more to it than that?

On a side note, I wonder whether it would be possible to extend our pragma support for such cases:

I think others know this about me, but whenever our (or another language's) solution to something is "we'll do this through pragmas", it makes me think "they stopped designing the language too soon."

bradcray commented 4 years ago

Regarding Michael's comments on the forwarding cases a few weeks back:

mppf commented 4 years ago

would having a yieldall concept in the language (rather than using a serial for loop to forward a potentially parallel iterator) help with this situation at all?

Sure it would, because a very common situation in these iterators is to forward to another iterator. Supposing you would write something like yieldall otherIterator - you might not even need to write a follower iterator. If you did need to write a follower iterator then maybe you could write yieldall (1..n, followThis=followThis) (or something) and the yieldall part would make it clear that the order-independent-ness should be taken from the other iterator. (Unlike for i in (1..n).these(tag=tag, followThis=followThis) it does not imply serial execution).

But, to large part, the pattern matching on iterator forwarding covers this case OK

I'm thinking that the compiler will include a simple analysis that finds loops like for i in _ {yield i;} and marks those as order independent (the key here being that the body of the loop is just a yield). This will cover a lot of the iterator forwarding cases we have.

mppf commented 4 years ago

Could the "sometimes vectorizable / sometimes not" aspect of an iterator be indicated via an explicit param argument that the author of the iterator builds into their implementation? The main drawback I can think of is that I want to forward to an iterator that didn't think to build that capability into itself and uses the wrong vectorization choice for me. But then presumably I shouldn't/couldn't forward to that iterator to begin with. Is that inherently problematic?

I'm not sure what you're referring to specifically. But generally speaking I don't think this situation comes up that often. And anywhere we could pass a param argument we could make a different overload depending on it, so that the loops are either order-independent or not. That said - I think I'd need to see an example to understand the idea better.

e-kayrakli commented 4 years ago

Engin, can you say more about why this is? What makes it seem more for-like than forall-like? Is it simply that vectorizing compilers have typically taken the approach of auto-vectorizing serial for loops, or is there more to it than that?

I think that's somewhat related to where I come from.

At a fundamental level forall being a data-oriented parallel loop implies to me that it is locality-aware. From more of a Chapel implementer perspective, it is a loop where a special iterator on a data type is invoked that may do many custom things as well as moving tasks around.

OTOH, the parallelism implied by vectorization is almost a sequential operation from a high-level programming language's perspective (and this is similar to what you are saying). It is only parallel by virtue of vector registers and not some task/thread-level parallelism introduced by Chapel's language stack.

simoll commented 4 years ago

OTOH, the parallelism implied by vectorization is almost a sequential operation from a high-level programming language's perspective (and this is similar to what you are saying). It is only parallel by virtue of vector registers and not some task/thread-level parallelism introduced by Chapel's language stack.

My 2 cents: You probably want two different ways to drive vectorization:

  1. Treating vectorization as a transformation for serial loops (maybe aided by some additional info on non-aliasing pointers, independent loop iterations). This is basically what @e-kayrakli is suggesting.
  2. A forall loop that treats the loop body as SIMT-parallel code. Vectorization in this setting basically means executing the instructions of the loop body in parallel with a lock-step schedule. Different to 1. this is not just a hint and not even a loop construct anymore - it's SIMT programming.

Option 1. helps the compiler to auto-vectorize loops that have serial semantics. I see the appeal but it's also limiting because you are stuck with serial loop iteration semantics. Option 2. allows you to do ISPC/RV - like programming where you can have warp shuffles, reductions across threads realized with SIMD instructions, etc. This is a powerful programming model that is not possible with Option 1 alone.

mppf commented 4 years ago

@simoll - thanks very much for your comment! It is very helpful.

I was also thinking about how to respond to @e-kayrakli's post because I feel it is starting to take sides on an issue that we have so far not really engaged with. The question is - for a Chapel user - is a vector lane within a forall loop a task?

I don't think this is something we've really debated yet. It is also something that is typically answered one way in a CPU/SIMD world and a different way in a GPU/SIMT one.

The "Note about task-local storage" in the issue description gets to part of it. The language-facing features that impact it are forall intents - specifically in intent (which one might view as creating task local storage) and reductions.

We will need in and reduce intents on a forall to function as we improve vectorization. In order for that to work, there are two strategies:

  1. Disable vectorization (or part of vectorization) when these features are used.
  2. Change whatever we do now per-task to instead be also per-vector-lane (e.g. each vector lane gets its own reduction state; each vector lane gets its own copy of in intent variables).

I think that in order to get to reasonably good GPU code generation, we have to choose 2 above.

This gets into a philisophical question of whether or not we would call each vector lane a task or not. I would personally lean towards "yes" - I would consider them a restricted sort of task. However we could do 2 above without doing that, because the language itself is not tied to the term "task local variable" for an in intent on a forall (say). (Of course, either way, there are documentation changes to do.)

bradcray commented 4 years ago

Generally speaking, I'm open to thinking of vector lanes as tasks and having current "task-local" concepts extend to vector lanes. That said, I'm not an expert in vectorization, so this is just my gut reaction, not a particularly well-informed thought.

In any case, I don't think we should feel particularly constrained to thinking of a task as "something that's executed by a qthread" (when CHPL_TASKS=qthreads), but should think more generally about a task as "some computation that could/should be done in parallel with other tasks" by whatever mechanism.

bradcray commented 4 years ago

Capturing a few stray thoughts after discussing this comment with @mppf:

iter foo() {
  for i in orderIndIter() do
    yield myData[i];
}

would not be a candidate for yieldall or automatic propagation because even though the iterator being invoked may be known to be order independent, the fact that we're not doing just yield i but yield ...some function of i... means that the iterator author would need to step in and assert that this is still order-independent. As a counter-example, if I wrote:

iter foo() {
  for i in orderIndIter() do
    yield infile.read(int);
}

I'm still invoking the same order-independent iterator, but presumably can't have the iterations execute in arbitrary order due to the use of a file. This was a helpful example to think about.

mppf commented 3 years ago

The recent part of this discussion (indicating order independence without creating qthreads-type tasks) is happening in #16404.

16403 discusses an overall strategy direction for vectorization support.