cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library
https://cp2k.github.io/dbcsr/
GNU General Public License v2.0
135 stars 46 forks source link

Evaluate USE_ACCEL=opencl #683

Open hfp opened 1 year ago

hfp commented 1 year ago

Evaluate USE_ACCEL=opencl and ideally share some feedback. There are tuned parameters for the following GPUs: P100, V100, A100-40GB, A100-80GB, H100, and PVC. For practically all GPU vendors, OpenCL is simply part of the "native" or preferred GPU runtime installation, e.g., installing CUDA installs Nvidia's OpenCL runtime as well. The OpenCL backend in DBCSR does not bail-out for kernels without tuned parameters and it carries tuned defaults for common GPUs, i.e., tuned parameters are not exactly necessary.

Standalone DBCSR has equal support for CUDA and OpenCL except OpenCL not falling back to larger GPU-supported GEMMs. For CP2K, OpenCL can be used as well up to the DBCSR support. However, CP2K can use DBCSR with OpenCL and CUDA otherwise (tested on Nvidia platforms). Otherwise means GRID, DBM, DBT, FFT, and CUDA-enabled dependencies like ELPA or COSMA. For the latter, SYCL or OpenMP support for GPUs may be available as well.

For Nvidia based platforms (not HIP), some HPC deployments are set to "exclusive mode" (see nvidia-smi) means that OpenCL-enabled applications cannot be used with multiple ranks per GPU. This can be lifted easily but requires a setup to either change or allow user option to toggle the compute mode.

The outcome of an evaluation can be ideally used to guide future development or contributions.