Multidimensional zippered iteration (or promotion) kills performance

bradcray commented 5 years ago

[This is a longstanding issue that I'm capturing on GitHub since I couldn't find it here]

When iterating over multiple multidimensional arrays using zippered iteration, performance currently falls off a steep cliff. For example:

var A, B: [1..n, 1..n] real;
forall (a, b) in zip(A, B) do
  a = 2*b;

This can also happen when calling a scalar function in a promoted manner with multidimensional arrays, since it is equivalent to a zippered iteration:

foo(A, B);
proc foo(x: real, y: real) { ... }

The reason for this is that the compiler currently doesn't know how to combine the parallel iterators for multidimensional arrays into a single compact loop nest, so it falls back to the catch-all implementation of creating an iterator object for each array and iterating over those objects in a zippered manner. While this result in a correct and very general-purpose implementation for very arbitrary array iterators, it is also very expensive.

Until this situation is improved, users can often work around the issue for cases that share the same domain (like above) by looping over the domain and indexing into the arrays to avoid the zippered iteration:

forall ind in A.domain do
  A[ind] = 2*B[ind];

or:

forall ind in A.domain do
  foo(A[ind], B[ind]);

LouisJenkinsCS commented 5 years ago

So, I was messing around with a prototype that just checks if the domains match at runtime and then indexes via domain based on this, and I'm getting some weird results.

use Time;

iter zipper(A1 : [?D1], A2 : [?D2]) ref {
    if D1.low == D2.low && D1.high == D2.high && D1.stride == D2.stride {
        for idx in D1 do yield (A1[idx], A2[idx]);
    } else {
        for (a1, a2) in zip (A1, A2) do yield (a1, a2);
    }
}

iter zipper (A1 : [?D1], A2 : [?D2], param tag : iterKind) ref where tag == iterKind.standalone {
    if D1.low == D2.low && D1.high == D2.high && D1.stride == D2.stride {
        forall idx in D1 do yield (A1[idx], A2[idx]);
    } else {
        forall (a1, a2) in zip (A1, A2) do yield (a1, a2);
    }
}

config const N = 128 * 1024 * 1024;
var D = {1..N};
var A1 : [D] int = 1..#D.size;
var A2 : [D] int = 1..#D.size;
var timer = new Timer();

timer.start();
for (a1, a2) in zip(A1, A2) {
    a1 += a2;
}
timer.stop();
writeln("Serial 'zip': ", timer.elapsed(), "s");
timer.clear();

timer.start();
forall (a1, a2) in zip(A1, A2) {
    a1 += a2;
}
timer.stop();
writeln("Parallel 'zip': ", timer.elapsed(), "s");
timer.clear();

timer.start();
for (a1, a2) in zipper(A1, A2) {
    a1 += a2;
}
timer.stop();
writeln("Serial 'zipper': ", timer.elapsed(), "s");
timer.clear();

timer.start();
forall (a1, a2) in zipper(A1, A2) {
    a1 += a2;
}
timer.stop();
writeln("Parallel 'zipper': ", timer.elapsed(), "s");
timer.clear();

timer.start();
for idx in D {
    A1[idx] += A2[idx];
}
timer.stop();
writeln("Serial Domain Indexing: ", timer.elapsed(), "s");
timer.clear();

timer.start();
forall idx in D {
    A1[idx] += A2[idx];
}
timer.stop();
writeln("Parallel Domain Indexing: ", timer.elapsed(), "s");
timer.clear();

for (x1, x2) in zipper(A1[1..10], A2[1..10]) do writeln((x1,x2));
for (x1, x2) in zip(A1[1..10], A2[1..10]) do writeln((x1, x2));

TIO

In the output you can see that both zip and zipper (custom iterator I made) produce the same results. What is weird is that the parallel and serial iteration over the domain and indexing that way appears to be a bit slower than zip iteration, but what is even more weird is that zipper seems to have lower execution times than iterating over the domain, despite doing the exact same thing when querying the domain in the zipper function arguments

Obtained from using my dual-core Macbook

Serial 'zip': 18.7566s
Parallel 'zip': 11.0116s
Serial 'zipper': 0.70649s
Parallel 'zipper': 0.373361s
Serial Domain Indexing: 7.75447s
Parallel Domain Indexing: 1.29738s

What is happening here? Also, I think this shows that checking the domain at runtime is actually valid and viable for performance, but also that there are some other subtle performance issues. @bradcray

LouisJenkinsCS commented 5 years ago

It seems that 1D array stuff is fine

use Time;

iter domainQueryIteration(A : [?D] int) ref {
    for i in D do yield A[i];
}

iter domainQuery(A, D) ref {
    for i in D do yield A[i];
}

config const N = 128 * 1024 * 1024;
var D = {1..N};
var A1 : [D] int = 1..#D.size;
var A2 : [D] int = 1..#D.size;
var timer = new Timer();

timer.start();
for a in domainQueryIteration(A1) {
    a += 1;
}
timer.stop();
writeln("Serial 'domainQueryIteration': ", timer.elapsed(), "s");
timer.clear();

timer.start();
for a in domainQuery(A1, D) {
    a += 1;
}
timer.stop();
writeln("Serial 'domainQuery': ", timer.elapsed(), "s");
timer.clear();

timer.start();
for idx in D {
    A1[idx] += 1;
}
timer.stop();
writeln("Serial domain iteration: ", timer.elapsed(), "s");
timer.clear();

Iterating over domain directly is very fast, but it gets weird when we query the domain from an array. Querying the domain from an array seems to induce an order of magnitude of overhead from somewhere. Any thoughts @bradcray

Obtained from using my dual-core Macbook

Serial 'domainQueryIteration': 14.8582s
Serial 'domainQuery': 1.48121s
Serial domain iteration: 1.57878s

LouisJenkinsCS commented 5 years ago

Would there be any issues with a runtime-assisted approached? That is, add some runtime checks to determine which path to take, either the old behavior of zip or iterating directly over the domain if they share the same domain, such as in the case of zipper? The overhead is going to be very minimal, and very likely dwarfed by the execution time of running the actual zippered iteration. This approach can significantly reduce the reliance on the compiler, and can possibly be 'swapped out' later once you do get a compile-time optimization going.

bradcray commented 5 years ago

I'm not sure what the cause of your performance mysteries are. You might try moving all of the code into a local scope, as the compiler treats things at module scope specially, and I've definitely had performance experiments give surprising results as a result of that special behavior (and, I think local scope code is arguably better behaved).

I think this approach is intriguing, though it would be more challenging to implement in the compiler than by hand as you have. The part of me that wants to stop sweeping things like this under the carpet is at war with the part of me that hates that our performance is so bad on codes like this.

Minor, but note that comparing low, high, and stride without comparing aligned isn't a guarantee that the domains are the same (e.g., 1..10 by 2 and 1..10 by 2 align 2). I think a simple == on the two domains would be the way to go.

LouisJenkinsCS commented 5 years ago

Cool! I'm glad you responded, I was just about to submit a PR to include this as a performance test (so that "sweeping things under the rug part" of you won't be able to ignore it for much longer :) ). I'll use comparison in the test and @ mention you

bradcray commented 5 years ago

so that "sweeping things under the rug part" of you won't be able to ignore it for much longer :)

When you've got infinite things to do, you can always ignore several of them longer!

bradcray commented 5 years ago

If the module- vs. local-scope does result in a performance difference, I'd have the performance test include both idioms, as (ultimately), I think we want to reduce the performance differences between these scopes whenever possible.

LouisJenkinsCS commented 5 years ago

I didn't even know you could mark a comment as 'off-topic'

mppf commented 4 years ago

I was just discussing this issue with @ronawho - I was asking why the compiler doesn't know how to combine the parallel iterators. It sounds like it has to do with zippering arrays of different shape, for example a 2x3 array zippered with a 3x2 array. In that case, we could check that the shape matches and proceed with the fast implementation.

It's my current view that we haven't really defined what happens when you zipper between different types - that has its own issue (#11504). Furthermore, I think that we'll ban zippering a 2x3 array with a 3x2 array - because the zippering will be defined on the indices in their domains, and the domain match. Along with that, I think that there will be some way to opt in to flattening the index space (that would allow them to be zipped) - e.g. linearize in https://github.com/chapel-lang/chapel/issues/11504#issuecomment-434313220. So we would still have the performance problem with the flattened call, but that would almost certainly come up less frequently.

bradcray commented 4 years ago

I don't view zippering of arrays with different shapes as being the issue (and agree that it should be illegal to zip two multidimensional arrays with different shapes without taking some sort of action like linearize). I view the main issue as being how to take two distinct iterators with potentially completely distinct control structures and fuse them into a single coherent loop nest in the generated code. E.g., just because two iterators are looping over an m x n array, it isn't guaranteed that they'll both do it with a doubly-nested loop nest where the outer loop is m trips and the inner is n. One could use a single loop; one could use a 4-deep loop if it was tiling; one could use a 2-deep loop in which the number of trips per loop weren't m and n respectively.

We might be able to handle many common cases by focusing on the loops in our main data structures (DR domains and arrays), which is arguably what we've done for 1D zippering as well, but this is the challenge in the general case as I see it.

mppf commented 4 years ago

The leader/follower paper has some thoughts in its Future Work section:

One ... limitation is related to zippered iteration over multidimensional data structures. Because zippered serial iteration is implemented via methods on an object, the multidimensional control flow is pushed into the object’s single advance method,which is then zippered with all of the other objects’ methods in a 1D loop. This results in a suboptimal control structure compared to the loop nest a programmer would manually write for a multi dimensional zippered iteration. To that end, we anticipate extending the leader-follower interface to support distinct iterators for each dimension. Multidimensional loop nests would then be generated by zippering each dimension’s iterators separately.

Having per-dimension followers seems like an interesting way to solve this problem by construction.

Edit: It might make specifying iterators more complex, but then there are quite a few iterator cases that are 1D, and the multidimensional ones tend to follow particular patterns.

mppf commented 3 years ago

It's my current understanding that Chapel iterators assume (by convention) that followThis will be iterated over in row-major order. As a result, I think we could have a solution where the compiler can fuse follower iterators with a certain structure. What is missing IMO is an easy way for the compiler to detect whether or not an iterator is "normal" enough to allow fusing.

e-kayrakli commented 8 months ago

I wanted to understand the current limitations for this and wanted to record a 15-min effort here to help get started when this gets picked up.

I used

use Math;
use Time;

config param use2D = false;
config const numElems = 2**28;

var Dom;
var firstIdx;
if use2D {
  const size = sqrt(numElems):int;
  Dom = {0..#size, 0..#size};
  firstIdx = (0,0);
}
else {
  Dom = {0..#numElems};
  firstIdx = 0;
}

var Arr1: [Dom] int, Arr2: [Dom] int;
var t: stopwatch;

t.start();
forall (a1, a2) in zip(Arr1, Arr2) {
  a1 = 2*a2;
}
t.stop();

writeln(Arr2[firstIdx]);
writeln("Time (ms): ", 1000*t.elapsed());

that shows a noticeable drop going from 1D to 2D for the same amount of data.

I dropped this check that I believe what prevents inlining here.

I bumped into a segfault here where the aggregate type (an iterator class) doesn't have the expected super and more fields. I couldn't whip up a quick solution to make progress. I don't have a good understanding of that what that iterator class is supposed to be.

So, while I think it should be achievable to inline zippered iterators of the same type, there are some assumptions in lowerIterators that are broken by it and that part also needs fixing.

e-kayrakli commented 8 months ago

I took another more serious look this time. I want to record where I am before the weekend. The first part has nothing new, but it always takes me some time to remember the details of the current implementation and challenges.

We inline iterators only if they contain a single loop with a single yield in them. They have a structure like:

iter foo() {
X();  // compiler calls this part "zip1" when inlining zippered loops
for[each] .... {
Y();  // "zip2"
yield ...;
Z();  // "zip3"
}
T(); // "zip4"
}

If all iterands look like this, inlining is relatively easy -- you just call the functions in order in the correct places.

If you have a multidimensional iterator, the structure is like:

iter bar() {
  for[each]  {
    for[each] {
       yield .... ;
    }
  }
}

The structure is more complicated than that can be handled with zip1, zip2 etc above. So, the compiler doesn't inline zippered loops like this.

Note that if you zipper two invocations to the following iterator, that doesn't inline either.
Neither the following iterator when zippered because of the "complicated" structure.

iter baz() {
  for[each] ... {
    if flag {
      yield ...;
    }
    else {
      yield ...;
    }
  } 
}

I attempted iterator fusion as Michael suggested above and we discussed in performance meetings occassionally. I am focusing on zippering the same iterator for the time being. The implementation is a bit more complicated than I thought.

Consider:

iter foo() {
  for i in 1..10 { yield i; }
}

for (x,y) zip(foo(), foo()) {

}

Here, the compiler generates a CForLoop ad-hoc at the invocation site, where the loop looks like:

for(i1=1, i2=1 ; i1<=10 ; i1+=1, i2+=1) { }

In other words it is the concatenation of the Chapel loops. But at this point, inner parts of the iterator foo are chunked up into zip1 et al already. In fact, lowerIterators adds calls to those functions which then inlined normally during function inlining.

So, if we have:

iter bar() {
  for i in 1..10 { 
    for j in 1..10 {
      yield i+j;
    }
  }
}

for (x,y) in zip(bar(), bar()) { }

The inner loops end up in different iterator classes' methods by the end of lowerIterators and it is too late to do much about it really.

My attempt is to turn the code above into:

iter fusedBarBar() {
  for (i1,i2) in zip(1..10, 1..10) {
    for (j1, j2) in zip(1..10, 1..10) {
      yield (i1+j1, i2+j2);
    }
  }
}

for (x,y) in fusedBarBar() { }

The first roadblock I can anticipate is that for loops still use tuples to represent zippered iterands. I started weening us off of that in https://github.com/chapel-lang/chapel/pull/17212, but wasn't able to finalize the work to extend it to fors: https://github.com/Cray/chapel-private/issues/1784#issuecomment-870777586. What that implies is that writing for... zip... during lowerIterators is not easy. It will likely require generic tuple support, and by now generics have gone away already. I can try picking that zip overhaul back up.

Another way of addressing the issue could be fusing iterators earlier. While that might work for for loops, foralls are lowered later on. So, by the end of resolution, some of the potentially problematic for... zip...s may not have been created.

Yet another potential solution is to circumvent for... zip... while creating fusedBarBar above and merge two ForLoops directly into a CForLoop.

On a quick attempt, I observed that we don't inline the advance method we generate for non-inlined iterators. The method is huge and I understand the hesitation inlining it. However, inlining it makes zippered 2D fors more than 2x faster on my workstation, but still about 3x slower than a 1D version of same trip count.

chapel-lang / chapel

Multidimensional zippered iteration (or promotion) kills performance #13147