Closed jeremiah-corrado closed 3 weeks ago
@jeremiah-corrado -- when you get the chance, could you repeat the experiment with #25712 ? I think the branch is in a good shape and it'd be good to confirm that it helps your use case (either in the OP or the full one) before proceeding with it.
Things are looking good on that branch, thanks @e-kayrakli! Here are the results of the same experiment:
UseLocalAccess \ nx &ny |
2048 | 4096 | 8192 | 16384 |
---|---|---|---|---|
false |
0.059343 | 0.093687 | 0.173559 | 0.547224 |
true |
0.061038 | 0.092108 | 0.173454 | 0.546798 |
true (opt disabled) |
0.059772 | 0.092562 | 0.174613 | 0.546817 |
The last row shows runtimes with --no-offset-auto-local-access
to make sure there wasn't noticeable overhead from the dynamic check, and it looks like there isn't 👍
Background
Chapel has an optimization that inserts
localAccess
calls when indexing into a distributed array in a data-parallel loop with indices that are know to be local. UsinglocalAccess
avoids an extra runtime check which can have a significant impact on performance. Today, this optimization only fires in cases like the following, where the iterated index is used directly in the array access:The optimization does not fire for indices that do not come directly from the iteration expression. For example, in the following code, if
arr
is block-distributed,arr[i+1,j]
will require communication wheni
is on the boundary of the executing locale's block, thereforearr[i+1,j
] cannot be replaced byarr.localAccess[i+1,j]
:Proposal
When using the stencil distribution, the optimization could be applied in the above case because stencil distributed arrays can have a "halo" region with local copies of elements from neighboring locales. This issue proposes that the auto-local-access optimization be expanded to fire when indexing into a stencil-distributed array using param-valued offsets that fall within a param-valued halo region. I.e., for the above example, if
arr
were stencil distributed with afluff
value of(1, 0)
or larger,localAccess
could be used for both access operations.Further work could also be done to expand the optimization to fire using some runtime checks (perhaps at the start of the loop execution), when the offsets and/or fluff size are not known at compile time.
Motivating Example
Consider the following code that uses stencil distributed arrays to solve the 2D heat equation. There are two versions of the kernel: one that uses typical array accesses, and another that uses
localAccess
for indices that are known to fall within the array's halo region.Note that the
fluff
argument for specifying the "halo" size is a param tuple, and the indexing offsets+/- 1
are param values that fall within the "halo" region. Thus the compiler could conceivably recognize that all the array accesses within the kernel are local.There is a significant performance advantage to using the localAccess version of this code, particularly for larger problem sizes. The following strong-scaling results were collected by running the above code on 9 locales of a Cray XC, with and without the explicit use of
localAccess
:UseLocalAccess
\nx
&ny
false
(sec)true
(sec)Although manually inserting the
localAccess
calls is a viable solution to achieve higher performance, developer productivity for developing stencil codes would be improved by introducing a compiler optimization that inserts them automatically. This becomes more of a factor when working with larger and higher dimensional stencils where the code is easier to read/write withoutlocalAccess
.