[Feature Request]: Expand auto-local-access optimization for Stencil Distribution

jeremiah-corrado commented 1 month ago

Background

Chapel has an optimization that inserts localAccess calls when indexing into a distributed array in a data-parallel loop with indices that are know to be local. Using localAccess avoids an extra runtime check which can have a significant impact on performance. Today, this optimization only fires in cases like the following, where the iterated index is used directly in the array access:

forall idx in arr.domain do arr[idx] += 1;

The optimization does not fire for indices that do not come directly from the iteration expression. For example, in the following code, if arr is block-distributed, arr[i+1,j] will require communication when i is on the boundary of the executing locale's block, therefore arr[i+1,j] cannot be replaced by arr.localAccess[i+1,j]:

forall (i, j) in arr.domain.expand(-1) do arr[i, j] = 2 * arr[i+1,j];

Proposal

When using the stencil distribution, the optimization could be applied in the above case because stencil distributed arrays can have a "halo" region with local copies of elements from neighboring locales. This issue proposes that the auto-local-access optimization be expanded to fire when indexing into a stencil-distributed array using param-valued offsets that fall within a param-valued halo region. I.e., for the above example, if arr were stencil distributed with a fluff value of (1, 0) or larger, localAccess could be used for both access operations.

Further work could also be done to expand the optimization to fire using some runtime checks (perhaps at the start of the loop execution), when the offsets and/or fluff size are not known at compile time.

Motivating Example

Consider the following code that uses stencil distributed arrays to solve the 2D heat equation. There are two versions of the kernel: one that uses typical array accesses, and another that uses localAccess for indices that are known to fall within the array's halo region.

Note that the fluff argument for specifying the "halo" size is a param tuple, and the indexing offsets +/- 1 are param values that fall within the "halo" region. Thus the compiler could conceivably recognize that all the array accesses within the kernel are local.

use StencilDist;
use Time;

config param UseLocalAccess = false;

config const nx, ny = 2048,
             nt = 100,
             alpha = 0.25;

const space = stencilDist.createDomain({0..<ny, 0..<nx}, fluff=(1,1)),
      spaceInner = space.expand(-1);

proc main() {
  var u1, u2: [space] real = 1.0;
  u1[ny/2..3*ny/4, nx/2..3*nx/4] = 2.0;

  var sw = new stopwatch();
  sw.start();

  for 1..nt {
    u1 <=> u2;
    u2.updateFluff();
    heatStencil(u1, u2);
  }

  writeln("time: ", sw.elapsed(), " sec");
  writeln("u mean: ", + reduce u1 / u1.size);
}

proc heatStencil(ref u: [] real, const ref un: [] real)
  where UseLocalAccess == false
{
  forall (i, j) in spaceInner {
    u[i, j] = un[i, j] + alpha * (
                -4 * un[i, j]
                + un[i+1, j] + un[i-1, j]
                + un[i, j+1] + un[i, j-1]
              );
  }
}

proc heatStencil(ref u: [] real, const ref un: [] real)
  where UseLocalAccess == true
{
  forall (i, j) in spaceInner {
    u[i, j] = un[i, j] + alpha * (
                -4 * un[i, j]
                + un.localAccess[i+1, j] + un.localAccess[i-1, j]
                + un.localAccess[i, j+1] + un.localAccess[i, j-1]
              );
  }
}

There is a significant performance advantage to using the localAccess version of this code, particularly for larger problem sizes. The following strong-scaling results were collected by running the above code on 9 locales of a Cray XC, with and without the explicit use of localAccess:

`UseLocalAccess` \ `nx`&`ny`	2048	4096	8192	16384
`false` (sec)	0.085544	0.183723	0.501507	1.59811
`true` (sec)	0.061522	0.087483	0.125183	0.439103
% improvement	28.1	52.4	75.0	72.5

Although manually inserting the localAccess calls is a viable solution to achieve higher performance, developer productivity for developing stencil codes would be improved by introducing a compiler optimization that inserts them automatically. This becomes more of a factor when working with larger and higher dimensional stencils where the code is easier to read/write without localAccess.

e-kayrakli commented 1 month ago

@jeremiah-corrado -- when you get the chance, could you repeat the experiment with #25712 ? I think the branch is in a good shape and it'd be good to confirm that it helps your use case (either in the OP or the full one) before proceeding with it.

jeremiah-corrado commented 1 month ago

Things are looking good on that branch, thanks @e-kayrakli! Here are the results of the same experiment:

`UseLocalAccess` \ `nx`&`ny`	2048	4096	8192	16384
`false`	0.059343	0.093687	0.173559	0.547224
`true`	0.061038	0.092108	0.173454	0.546798
`true` (opt disabled)	0.059772	0.092562	0.174613	0.546817

The last row shows runtimes with --no-offset-auto-local-access to make sure there wasn't noticeable overhead from the dynamic check, and it looks like there isn't 👍

chapel-lang / chapel