chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

Improve PRKs #6162

Open ben-albrecht opened 7 years ago

ben-albrecht commented 7 years ago

Here is a meta-issue to track progress on the implementations of Intel's Parallel Research Kernels in Chapel.

Resources

General

Implementations

Stencil

Transpose

Synch_p2p

DGEMM

DGEMM is distributed in its current state but it is not SUMMA. Note that the PRK specs does not specify an algorithm but MPI1 implementation is based on SUMMA.

Maintaining multiple implementations would be useful (see @e-kayrakli's comment below)

PIC

Sparse

NStream

AMR

A variation of Stencil that spawns subgrids to emulate adaptive mesh refinement

Branch

Very simple one that tests branch performance

Random

Reduce

Note: "Reduce" may be a misnomer as it seemingly does a element-wise vector addition where vectors are at specific parts of the memory.

e-kayrakli commented 7 years ago

I have been working on Transpose recently and wanted to capture what is missing in the current implementation:

PRK specifications and the reference MPI1 implementation uses column-major arrays for both matrices and uses column-wise data decomposition. Then, the output array is accessed in column-major order where the input is accessed in row-major order. Current Transpose implementation in Chapel do things rather haphazardly in this context. Given that there is no native column-major layout in Chapel (yet?), I think arrays can be distributed with row-major decomposition and the access orders can be reversed (row-major on output array) to emulate something close to the reference implementation and the specs.

e-kayrakli commented 6 years ago

@ben-albrecht, looking at the issue again I think there are few things that can be added:

I don't think I can modify the original post, so you can interpret these however you wish and update it.

ben-albrecht commented 6 years ago

@e-kayrakli - Updated. Let me know if you see anything that could be updated further.

caizixian commented 6 years ago

Sorry, I wasn't aware of the existence of this issue. FWIW, performance trend of transpose as of 1.17.1 can be found in #11031