Improve PRKs - Githubissues

ben-albrecht commented 7 years ago

Here is a meta-issue to track progress on the implementations of Intel's Parallel Research Kernels in Chapel.

Resources

General

[ ] Clean up and improve maintainability of prk README
[x] Update directory names to reflect names used in PRKs, e.g. casing (#6405)
[x] Add contributed by header comments, giving credit to authors and contributors. (#6405)
[x] Use --correctness flag rather than --validate for clarity (#6405)

Implementations

Stencil

[ ] Consider dynamic unrolling approach used in OpenMP version as described in #6153
[ ] Hoist chpl__getPrivatizedCopy #6184

Transpose

[ ] Rewrite distributed implementation to reflect reference version
- Current implementation is naive blockDist
[ ] Enable multilocale performance testing

Synch_p2p

Current implementation does not reflect reference version

DGEMM

DGEMM is distributed in its current state but it is not SUMMA. Note that the PRK specs does not specify an algorithm but MPI1 implementation is based on SUMMA.

Maintaining multiple implementations would be useful (see @e-kayrakli's comment below)

[ ] Performance testing
- Blocking #6388

PIC

[ ] Distributed implementation
[ ] Performance testing

Sparse

[ ] Performance testing

NStream

[ ] Performance testing

AMR

A variation of Stencil that spawns subgrids to emulate adaptive mesh refinement

[ ] Implement

Branch

Very simple one that tests branch performance

[ ] Implement

Random

[ ] Implement

Reduce

Note: "Reduce" may be a misnomer as it seemingly does a element-wise vector addition where vectors are at specific parts of the memory.

[ ] Implement

e-kayrakli commented 7 years ago

I have been working on Transpose recently and wanted to capture what is missing in the current implementation:

PRK specifications and the reference MPI1 implementation uses column-major arrays for both matrices and uses column-wise data decomposition. Then, the output array is accessed in column-major order where the input is accessed in row-major order. Current Transpose implementation in Chapel do things rather haphazardly in this context. Given that there is no native column-major layout in Chapel (yet?), I think arrays can be distributed with row-major decomposition and the access orders can be reversed (row-major on output array) to emulate something close to the reference implementation and the specs.

e-kayrakli commented 6 years ago

@ben-albrecht, looking at the issue again I think there are few things that can be added:

Missing PRKs for completeness (some may be more important then others, like AMR):
- AMR: A variation of Stencil that spawns subgrids to emulate adaptive mesh refinement
- Branch: Very simple one that tests branch performance
- Random: Another simple one
- Reduce: At least a straightforward implementation should be simple. ("Reduce" may be a misnomer as it seemingly does a element-wise vector addition where vectors are at specific parts of the memory.)
More clarification for DGEMM: DGEMM is distributed in its current state but it is not SUMMA. Note that the PRK specs does not specify an algorithm but MPI1 implementation is based on SUMMA. FWIW, in a more proof-of-concept implementation I observed significant speedups and not-so-good scalability with a more naive approach where remote data is localized in bulk. I think in general it is good to have multiple versions (including the current one to see fine-grained communication performance) for especially something as important as matrix multiplication.

I don't think I can modify the original post, so you can interpret these however you wish and update it.

ben-albrecht commented 6 years ago

@e-kayrakli - Updated. Let me know if you see anything that could be updated further.

caizixian commented 6 years ago

Sorry, I wasn't aware of the existence of this issue. FWIW, performance trend of transpose as of 1.17.1 can be found in #11031

chapel-lang / chapel

Improve PRKs #6162