ParRes / Kernels

This is a set of simple programs that can be used to explore the features of a parallel platform.
https://groups.google.com/forum/#!forum/parallel-research-kernels
Other
404 stars 106 forks source link

Add multi-threaded rust kernels for nstream/transpose/dgemm #614

Closed s-sajid-ali closed 1 year ago

s-sajid-ali commented 1 year ago

New PRK implementation checklist

Which kernels are implemented?

Documentation and build examples

Added relevant dependencies to Cargo.toml files, which will prompt cargo to fetch and build relevant dependencies.

Do you certify that your contribution is made in good faith and does not attempt to introduce any negative behavior into this project?

Additional Changes

Fixed a minor issue with nstream-kokkos.cc to account for changes introduced as part of the 3.7.00 release.

Overview of performance from the new kernels on an M1-max MacBookPro:

nstream:

All results obtained with using 10 iterations over 64million elements.

Language/Kernels Rate (MB/s) Comments
RUST/nstream-iter 77813.238 serial
RUST/nstream-unsafe 78711.622 serial
RUST/nstream 78955.383 serial
C1z/nstream 80771.444 serial
C1z/nstream-mpi 136195.568 16 MPI ranks
C1z/nstream-petsc 159297.680 16 MPI ranks
RUST/nstream-rayon 162295.541 thread-parallel
C1z/nstream-openmp 163650.220 thread-parallel

transpose:

All results obtained with using 10 iterations over matrix order 16384, tilesize of 32.

Language/Kernels Rate (MB/s) Comments
C1z/transpose 2538.038 serial
RUST/transpose 2552.979 serial
RUST/transpose-iter 10345.623 serial
C1z/transpose-openmp 17211.489 thread-parallel
RUST/transpose-rayon 44669.385 thread-parallel

dgemm

All results obtained with using 10 iterations over matrix order 1024. Language/Kernels Rate (MB/s) Comments
CXX11/dgemm-vector 6407.38 serial
CXX11/dgemm 6840.13 serial
RUST/dgemm 12054.231 serial
RUST/dgemm-iter 12257.906 serial
RUST/dgemm-blis 49340.242 unknown
RUST/dgemm-rayon 92549.489 thread-parallel
jeffhammond commented 1 year ago

This is really cool. Thank you. Give me a few days to test it.

jeffhammond commented 1 year ago

Results look fantastic. I have no issues building on Apple M1, Tiger Lake (Ubuntu 22) or Orin (Ubuntu 20).

Thanks for the contribution. If you want to do stencil, it's probably about the same work as transpose or dgemm. I don't know enough about Rayon but p2p should be feasible using either the task or hyperplace design shown in C1z (or Cxx11).