EPCCed / gpu-directives

Contains material for a course using GPU directives
0 stars 0 forks source link

Review OpenMP techniques in real codes #8

Closed lucaparisi91 closed 1 month ago

lucaparisi91 commented 2 months ago

Useful material from the Excalibur program:

I'm surprised how little there was because I think efficiently utilising the CPU+Accelerator architecture is one of the primary challenges of modern exascale machines. That is especially doing it in a way which is portable at all.

lucaparisi91 commented 1 month ago

GAMESS

Chemistry calculation with EFMO, method calling QM solvers on subsystems. State described by a set of molecular orbitals. GPU node 59x a CPU node for a single node, but just 6.7 increasing the number of nodes because of communications.

1) HF Two electron integrals, contraction with density matrix to fock matrix. dominant cost. They Wrote a miniapp called miniERIs and ported it to GPU. 2) RI-MP2
Interfaced fortran code to c code and used C++ libraries for offloading. Then incorporated in EFMO. 3) RI- CCSD(T) Tensor contraction is the bottleneck. Solved using cublas/hipblas. 4) Offloadind DFT Exchange Correlation Recursive partitioning of the XC code. Seen as a Sparse Matrix -> compressed in dense matrix and offloaded using GPU. 5) NWChemEX ( https://www.exascaleproject.org/research-project/nwchemex/ ) uses TAMM for offloading tensor calculations to the GPU ( https://github.com/NWChemEx/TAMM ). Relies on ga for communications. Ongoing work to support gpu with cuda aware mpi ( https://github.com/GlobalArrays/ga , )

Much of the strategy relies on using vendor libraries for the matrix manipulation.

lucaparisi91 commented 1 month ago

ExaHype

GPU implementation described in https://link.springer.com/chapter/10.1007/978-3-031-32041-5_4 . Use omp_get_mapped_ptr to build a list of device pointers instead of mappers. Find lower performance when launching multiple kernels from different threads , instead of getting lower performance. Adding staging option means that the packe is locked ( pinned ?). Memory allocations on the device are expensive.

The number of concurrent kernels in flight needs to be small in order to obtain better performance on the GPU.

lucaparisi91 commented 1 month ago

Turbulence

3D distributed FFT spread on several GPUs. Needs strided memory copying memory. Can be done in CUDA. In OpenMP

lucaparisi91 commented 1 month ago

GenASIS

A class manage data and medata. Handles allocation and transfers to the device with the storageForm class. Some functions are only defined for C, not Fortran. Use pointer remapping ( how does pointer undirection work ) ? Currently working on metadirectives in OpenMP 5.1 => avoid code duplication when using CPU and GPU. OMPX_PINNED_MEM_ALLOC is an extension on some compilers to allocate pinned memory. Some routines are only available on C and not on Fortran.

lucaparisi91 commented 1 month ago

Portable Numerical libraries

From ECP ( Ginko, MAGMA, PLASMA, SLATE )

Ginkgo

Has an OpenMP backend, realies on using atomics ( read,write,update ) Use normal reductions and declared reductions for sparse matrices.

MAGMA

Several hybrid algorithms. Host processor needs to be efficient, to keep up with the GPU. Dense and Sparse Linear Algebra.

PLASMA

Based on OpenMP tasking. Cholesky inversion. Chain of smaller operations with data dependencies.

SLATE

Distributed memory with multithreaded MPI

lucaparisi91 commented 1 month ago

QMCPACK