Array Slicing Performance

thomasrolinger commented 6 years ago

Summary of Problem

The performance of array slicing seems to be much slower than I would have expected. The problem was presented in this Stack Overflow post: https://stackoverflow.com/questions/48243717/array-slicing-performance-in-chapel

The particular scenario involves using array slicing to get a reference to a matrix row and then iterating over the elements in that row to do some kind of operation (see slices.chpl below). This entire procedure is repeated for all rows in the matrix. The performance of doing this appears to be 20-30x slower than omitting the array slicing step and just directly accessing the original matrix by the row and column indices (see direct.chpl below).

This issue is not so much about how to do this type of procedure efficiently but more about why is array slicing slowing down the code so much.

I've reproduced this issue on a server node that has 2x 10 core E5-2650v3 Xeon Haswell processors and 512GB of memory as well as a Macbook Pro.

NOTE: As indicated in the output of printchplenv below, I am using POSIX threads (fifo). There is an ongoing issue being looked at with regards to the performance of qthreads vs fifo for my particular application via email with the Chapel team. However, I have tested this array slicing issue with qthreads and the performance is the same as described above.

Steps to Reproduce

Source Code: slices.chpl

use Time;
var numRows = 100000;
var numCols = 35;
var D : domain(2) = {0..numRows-1, 0..numCols-1};
var mat : [D] real;
mat = 1.0;
var totalTimer : Timer;
var accum : [0..numCols-1] real;
accum = 0.0;

totalTimer.start();
for i in 0..numRows-1 {
    ref myRow = mat(i,..);
    for j in 0..numCols-1 {
        accum[j] += i * myRow[j];
    }
}

totalTimer.stop();
writeln("Accum:");
writeln(accum);
writeln("\nTotal Elapsed time: ", totalTimer.elapsed(), " seconds");

Source Code: direct.chpl

use Time;
var numRows = 100000;
var numCols = 35;
var D : domain(2) = {0..numRows-1, 0..numCols-1};
var mat : [D] real;
mat = 1.0;
var totalTimer : Timer;
var accum : [0..numCols-1] real;
accum = 0.0;

totalTimer.start();
for i in 0..numRows-1 {
    for j in 0..numCols-1 {
        accum[j] += i * mat[i,j];
    }
}

totalTimer.stop();
writeln("Accum:");
writeln(accum);
writeln("\nTotal Elapsed time: ", totalTimer.elapsed(), " seconds");

Compile command: chpl --fast slices.chpl chpl --fast direct.chpl

Execution command: ./slices ./direct

Expected Output (slices): Accum: 4.99995e+09 4.99995e+09 4.99995e+09 4.99995e+09 ...

Total Elapsed time: 0.127866 seconds

Expected Output (direct): Accum: 4.99995e+09 4.99995e+09 4.99995e+09 4.99995e+09 ...

Total Elapsed time: 0.00515 seconds

Configuration Information

Output of chpl --version: chpl Version 1.16.0
Output of $CHPL_HOME/util/printchplenv --anonymize: CHPL_TARGET_PLATFORM: linux64 CHPL_TARGET_COMPILER: gnu CHPL_TARGET_ARCH: native CHPL_LOCALE_MODEL: flat CHPL_COMM: none CHPL_TASKS: fifo CHPL_LAUNCHER: none CHPL_TIMERS: generic CHPL_UNWIND: none CHPL_MEM: cstdlib CHPL_MAKE: gmake CHPL_ATOMICS: intrinsics CHPL_GMP: gmp CHPL_HWLOC: hwloc * CHPL_REGEXP: re2 CHPL_WIDE_POINTERS: struct CHPL_AUX_FILESYS: none
Back-end compiler and version, e.g. gcc --version or clang --version: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)

bradcray commented 6 years ago

Since the SO question was edited to include the full program codes, I followed up on my answer there with a comment, but to summarize here for completeness:

Using these complete codes, I also see a larger gap between using the slice and not than I had in my own experiments, and I believe that the difference can be accounted for due to the amount of work done between slices. Slicing results in a constant, but nontrivial, amount of overhead in order to (i) compute and create the domain of the resulting array view, and (ii) create and set up the array descriptor for the view, and the amount of work done between slices will determine the degree to which the overhead of slicing is amortized or not.

Using the provided codes, I don't see the precise performance gap shown here, but can easily believe that differences could be tied to differences in system, Chapel version (I'm using master), back-end compiler version, etc.

We haven't put any significant effort into tuning this code path and there may be opportunities for tightening it up. But I don't think that trying to reproduce C-level optimizations is a good one to motivate that work, particularly given my understanding that the more straightforward Chapel code results in performance comparable to the C (?).

buddha314 commented 6 years ago

In the SO question, you mentioned

Generally speaking, we tend not to encourage using array views (array slices and reindexing) in performance-critical code, and particularly not as a means of trying to reproduce scalar C-style optimizations. Our preference is to have users write the code in the cleaner / more direct style and rely on the Chapel compiler to optimize it (and/or let us know when you find cases where we're leaving performance on the floor). Chapel's array views are designed primarily as a productivity feature, for example to support the creation of a view of an array or subarray in order to meet the formal argument requirements of an existing routine.

Can you give a bit more detail on that? I've been using slices quite a bit. If there is a better way, I'd love to know it.

thomasrolinger commented 6 years ago

Great! I assumed the performance of the code was more related to how my algorithm is performed rather than an actual bug in Chapel. I just wanted to have it verified.

And yes, the more straightforward Chapel code has comparable performance to C, so it wouldn't make much sense to try to optimize Chapel for this particular case.

bradcray commented 6 years ago

Can you give a bit more detail on that? I've been using slices quite a bit. If there is a better way, I'd love to know it.

As this discussion is showing, whether or not slices are worthwhile depends on how much mileage you get out of them. If you created a slice, accessed it once, and threw it away, the overhead would probably kill you pretty quickly if you cared primarily about performance. On the other end of the spectrum, if you created a slice and it lived through your whole program, you probably wouldn't notice its overhead. For the programs in this issue / SO question, the slices are not used many times before being discarded and that's impacting performance significantly. For other cases, the impacts would be more minor.

For rank-change slicing similar to what's being used in this example:

ref Arow = A[i, ..];
...Arow[j]...

The alternative would be to not create Arow and just always use 2D indexing on A instead. Thus, any expression Arow[j] would be rewritten A[i, j].

A similar case can be made for non rank-change slicing:

ref Ainterior = A[2..n-1, 2..n-1];
...Ainterior[i,j]...

Here, any Ainterior[i,j] expressions could be replaced with A[i,j] expressions since the slice doesn't modify the indices in any way, it just restricts the virtual size / index set of the array view.

In the third case of reindexing, you could just translate between the old and new indices yourself manually...

As mentioned on the issue, if the reason for creating the view is to conform to some requirement of a routine's formal argument (e.g., a library routine requires a 1D array so and I want to pass in a row of my 2D array, so I'll use a rank-change slice to do so) then you don't have much choice short of rewriting the library routine.

Other than that, I think most uses of slices / array views are productivity oriented -- to have the language do more bookkeeping for you in order to support more natural expressions of computation. And that's completely valid as long as the cost of that bookkeeping doesn't outweigh the advantages.

bradcray commented 6 years ago

I'm going to close this issue based on @thomasrolinger's apparent satisfaction, though we can continue kicking the "when to use slices" discussion around here as desired.

ty1027 commented 6 years ago

I have also experienced similar slowdown when making practice for getting better performance. Is it worth reporting such cases somewhere (here on the issue page or through the user mailing list)? The code may become slightly longer than "minimal working examples" (typical of StackOverflow), so possibly first via mailing list and then copy to the Github issue page for important cases (if any)? I've been heavily using array slicing or whole-array notation in Fortran and Python/Numpy, so writing explicit loops for dimensions with short extents is something that I would like to avoid, if possible. (I think the key point is that an N-dimensional array (or "tensor") can have a dimension with a short extent, despite that the total array size is very large.) One typical example is the "N-body" code in the computer language benchmark site, where tuples are used for x/y/z degrees of freedom, while the use of (usual) 2D array and the coding similar to Python or Fortran makes it very slow (as in the original post of this thread).

bradcray commented 6 years ago

@ty1027: It's definitely fair game to report performance mysteries and ask for help understanding them. I'd suggest starting with a GitHub issue rather than the mailing list. We get notifications when new issues are posted, and this typically saves time, puts the conversation in the public space, and avoids filling up the entire mailing list's inboxes for issues that may be personal. If you think the issue might be of broader interest, you could send a mail to chapel-users or chapel-developers letting people know that the issue is there (but again, we'll automatically get a notice about it once it's filed).

With respect to your specific topic, Chapel tries to be smart about which dimension to parallelize when forall loops are applied to multidimensional domains/arrays, but is far from perfect in this regard, and some user hooks may help you adjust its defaults and get better performance. But I'll save the details for your issue.

buddha314 commented 6 years ago

That begs the question on where we should post if we want advice on how to design a particular algorithm. I'm happy opening issues if you think that works, maybe a "design advice" tag or something?

For instance, I want to implement a graph method the graphBLAS people call "minimum cost path" between two matrices A and B.

C(i,j) = A +.min B  :=  min_k [ A(i, k) + B(k, j)]

Which appears to me to be a whole lotta slicin' goin' on.

bradcray commented 6 years ago

I think such questions could go on issues as well, but when it comes to optimal algorithm design, the Chapel implementation team may not be the best resource—while we have knowledge about Chapel implementation and performance, we're not algorithm experts across broad domains. So a more public forum like chapel-users or SO may result additional responses that an issue wouldn't (where my guess is most people outside the development team are less likely to see a question unless they have the same question).

Assuming i, j, and k are scalar indices in your code above, I don't think there's any slicing in your code above, just indexing. Slicing is when you select a sub-array using a range in one or more dimensions. But maybe I'm missing something.

buddha314 commented 6 years ago

I'm not sure how to take the max along the row without slicing.

ty1027 commented 6 years ago

@bradcray Thanks very much for your suggestion. Then I will start directly from the Github issue after re-trying some more with the latest Chapel version (because my experiences are mixed with the current and older versions). I will also try to make some performance comparison with other languages (particularly Fortran + OpenMP etc). Generally, my experience is that if I do not use array slices, Chapel is quite fast and almost matches the performance of Fortran (or even faster in parallel cases). On the other hand, there are sometimes surprising slowdown depending on the way a code is written, which I currently assume mostly arising from array slices and "reduce" operations. So, to write a code as efficient as Fortran, I need to be pretty careful about where to use slices/reduce, or use tuples as "fixed-size" arrays (but the latter has also various weak points, so not a general solution maybe).

These things are sometimes a real "issue" for the compiler and sometimes just a code design (or "TIPS" level things), so I often wonder where to post a question or a topic about coding. The graphBLAS question that Brian mentions above is (I think) a typical case that I wonder where to post. StackOverflow does not recommend any long discussions (it actually encourages short Q/As). It also has rather strict rules for the contents (e.g., no question allowed for asking external resources/manuals/libraries). On the other hand, my understanding of the Github issue page is more directly related to the real issue of the compiler and related things. So probably, I guess this is where some online "Chapel" forum may fill the gap (if created).

Anyway, I will file an issue after re-trying my old test codes. Thanks much!

ty1027 commented 6 years ago

I've just tried something similar to what Brian mentioned above:

use Time;
var timer = new Timer();

config const L = 100, S = 10;  // long and short extents

var a: [1..L, 1..S] real;
var b: [1..S, 1..L] real;
var c: [1..L, 1..L] real;

// dummy values
for (i1, i2) in a.domain do a[ i1, i2 ] = ( i1 + i2 );
for (i1, i2) in b.domain do b[ i1, i2 ] = 1.0 / ( i1 + i2 );

var tmp: real;

// use explict loop
timer.clear();
timer.start();
for i1 in 1..L {
    for i2 in 1..L {
        tmp = -1.0e300;
        for k in 1..S do
            tmp = max( a[ i1, k ] * b[ k, i2 ], tmp );
        c[ i1, i2 ] = tmp;
    }
}
timer.stop();

writeln( "explicit loop:" );
writeln( "  max(c) = ", (max reduce c) );
writeln( "  time   = ", timer.elapsed() );

// use array slice + reduction
timer.clear();
timer.start();
for i1 in 1..L {
    for i2 in 1..L {
        c[ i1, i2 ] = ( max reduce ( a[ i1, .. ] * b[ .., i2 ] ) );
    }
}
timer.stop();

writeln( "array slice:" );
writeln( "  max(c) = ", (max reduce c) );
writeln( "  time   = ", timer.elapsed() );

Then, on my environment (see below), I get

$ ./a.out --L=500
explicit loop:
  max(c) = 250.5
  time   = 0.002458
array slice:
  max(c) = 250.5
  time   = 0.672836

$ ./a.out --L=1000
explicit loop:
  max(c) = 500.5
  time   = 0.009515
array slice:
  max(c) = 500.5
  time   = 2.67285

$ ./a.out --L=2000
explicit loop:
  max(c) = 1000.5
  time   = 0.036997
array slice:
  max(c) = 1000.5
  time   = 10.5961

This is probably due to the small value of S (= 10) and also the overhead of reduce. (My "top" command shows that the explicit-loop part runs with 1 thread (= 100% cpu load), while the slice + reduce part runs with 4 threads (= 400% cpu load). For this short extent, a serial loop over 'k' seems more efficient, but the use of array slices and reduction automatically makes it parallel. So, although the latter code is more readable (to me), it should be avoided for performance (in this case). I guess if we had a "serial reduction" function (separately from reduce), the compact expression and efficiency may be met simultaneously.

(This example mixes both array slices and reduction, so I guess it is not very much suited for studying the performance of slices only. So I will try a different code later without reduction.)

-- Here is the output of printchplenv (on OSX10.11, chapel-1.16 --fast, compiled from source + GNU).

CHPL_TARGET_PLATFORM: darwin
CHPL_TARGET_COMPILER: gnu +
CHPL_TARGET_ARCH: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_MAKE: make
CHPL_ATOMICS: intrinsics
CHPL_GMP: gmp
CHPL_HWLOC: hwloc
CHPL_REGEXP: re2
CHPL_WIDE_POINTERS: struct
CHPL_AUX_FILESYS: none

bradcray commented 6 years ago

I'm not sure how to take the max along the row without slicing.

A way to do this without slicing would be:

var maxInRow = max reduce [col in 1..numCols] A[k, col];

Ultimately, if you wanted to compute the max value in every row, you'd want to use a partial reduction (one that reduced a 2D array to a 1D array of results). But unfortunately those are not implemented yet. :(

bradcray commented 6 years ago

@ty1027:

It'd be interesting to compare:

for i1 in 1..L {
    for i2 in 1..L {
        c[ i1, i2 ] = ( max reduce ( a[ i1, .. ] * b[ .., i2 ] ) );
    }
}

with:

for[all] (i1, i2) in {1..L, 1..L} do
  c[i1, i2] = max reduce (for k in 1..5 do (a[i1, k] * b[k, i2]));

(parallel or serial outer loop, no slicing, and serialized reduction) and/or:

for[all] (i1, i2) in {1..L, 1..L} do
  c[i1, i2] = max reduce [k in 1..5] do (a[i1, k] * b[k, i2]));

(parallel or serial outer loop, no slicing, and parallel reduction).

If parallelism is applied at the outer loop, additional parallelism should be squashed in inner loops if all the cores are busy (as they should be) with only minor (hopefully negligible) execution-time overhead.

Another way to throttle parallelism, alluded to in my earlier response, is to use the hooks for controlling the degree of data parallelism.

buddha314 commented 6 years ago

Cool, both solutions are elegant, though I don't think either would have occurred to me. I see @bradcray is using for[all] instead of forall. What's the difference? Will this work on the Sparse matrices as well?

buddha314 commented 6 years ago

Going to @ben-albrecht about the sparse stuff. Could you use the explicit loop above on a sparse matrix?

bradcray commented 6 years ago

using for[all] instead of forall.

This isn't actual syntax, it was just meant to suggest you may want to use either for or forall there depending on whether you want that loop to be parallel or not.

chapel-lang / chapel