WIP: Add GPU aware MPI support in cannon algorithm

gsitaram commented 1 year ago

If CUDA or HIP backend is enabled, --WITH-DBCSR-G2G CMake option enables the following:

Use device arrays in MPI transfers in the cannon algorithm implementation.
Move norms calculation to GPU
Transfer the matrix images once to GPU and transpose the B matrix once. All MPI Isend and Irecv calls will transfer the data in device buffers instead of host buffers.
MPI transfers and multiplications remain overlapped

Requirements:

An MPI implementation that supports GPU aware communication.

Need help with the following:

The OpenCL build fails with "undefined reference to c_calculate_norms" error, @hfp, could you help me fix it for the OpenCL backend?
The ROCm build fails with undefined references in OpenMPI. I am guessing we have to fix the CI build to include ROCm aware support when building UCX that OpenMPI depends on.
I need to include AMD copyright lines in each file modified, but this does not seem acceptable for the pre-commit script. Please let me know how to get around this.

jenkins-cscs commented 1 year ago

Can one of the admins verify this patch?

codecov[bot] commented 1 year ago

Codecov Report

Attention: 64 lines in your changes are missing coverage. Please review.

Comparison is base (5d92807) 67.0% compared to head (b76c8bd) 66.8%. Report is 129 commits behind head on develop.

:exclamation: Current head b76c8bd differs from pull request most recent head 0c37025. Consider uploading reports for the commit 0c37025 to get more accurate results

Files	Patch %	Lines
src/mm/dbcsr_mm_common.F	0.0%	55 Missing :warning:
src/mm/dbcsr_mm.F	25.0%	9 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## develop #647 +/- ## ========================================= - Coverage 67.0% 66.8% -0.2% ========================================= Files 105 105 Lines 29121 29188 +67 ========================================= + Hits 19521 19523 +2 - Misses 9600 9665 +65 ``` | [Flag](https://app.codecov.io/gh/cp2k/dbcsr/pull/647/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=cp2k) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/cp2k/dbcsr/pull/647/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=cp2k) | `66.8% <4.4%> (-0.2%)` | :arrow_down: | | [with-blas](https://app.codecov.io/gh/cp2k/dbcsr/pull/647/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=cp2k) | `66.8% <4.4%> (-0.2%)` | :arrow_down: | | [with-libxsmm](https://app.codecov.io/gh/cp2k/dbcsr/pull/647/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=cp2k) | `66.2% <4.7%> (-0.2%)` | :arrow_down: | | [with-mpi](https://app.codecov.io/gh/cp2k/dbcsr/pull/647/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=cp2k) | `66.9% <4.4%> (-0.2%)` | :arrow_down: | | [with-openmp](https://app.codecov.io/gh/cp2k/dbcsr/pull/647/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=cp2k) | `65.7% <4.4%> (-0.2%)` | :arrow_down: | | [without-mpi](https://app.codecov.io/gh/cp2k/dbcsr/pull/647/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=cp2k) | `66.0% <4.4%> (-0.2%)` | :arrow_down: | | [without-openmp](https://app.codecov.io/gh/cp2k/dbcsr/pull/647/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=cp2k) | `65.8% <4.4%> (-0.2%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=cp2k#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

hfp commented 1 year ago

The OpenCL build fails with "undefined reference to c_calculate_norms" error, @hfp, could you help me fix it for the OpenCL backend?

Yes, I will help with this. I think it should not be a blocker for merge.

Btw, is there a way to perform atomic FP-add on (certain) AMD GPUs using OpenCL? It seems the CUDA based code path does that but everything I tried in OpenCL did not work for me like C11 atomics, legacy builtins, or even cheating by guessing a prototype function and calling it. I found some doc for inline assembly in OpenCL but I am a bit hesitant to adopt it (like reading all the arch specific doc for AMD GPUs). For NVidia btw I am using PTX inline assembly in OpenCL.

hfp commented 1 year ago

Regarding the failing check of file headers, I propose to put copyright lines in front of the LICENSE file like this, and to stick with generic

! Copyright (C) by the DBCSR developers group - All rights reserved                                !

Otherwise we are back to the business of duplicating author notes as part of the source code. I think authorship is recorded as part of repository's metadata and no one is "left behind". As an extension, I also propose to drop AUTHORS file.

A general overhaul is to have a file extension .md for LICENSE (and perhaps similar files).

( sorry the proposal is not exactly related and can be perhaps a separate PR )

gsitaram commented 1 year ago

The OpenCL build fails with "undefined reference to c_calculate_norms" error, @hfp, could you help me fix it for the OpenCL backend?

Yes, I will help with this. I think it should not be a blocker for merge.

Thank you so much.

Btw, is there a way to perform atomic FP-add on (certain) AMD GPUs using OpenCL? It seems the CUDA based code path does that but everything I tried in OpenCL did not work for me like C11 atomics, legacy builtins, or even cheating by guessing a prototype function and calling it. I found some doc for inline assembly in OpenCL but I am a bit hesitant to adopt it (like reading all the arch specific doc for AMD GPUs). For NVidia btw I am using PTX inline assembly in OpenCL

I will find out more details and inform you.

gsitaram commented 1 year ago

Regarding the failing check of file headers, I propose to put copyright lines in front of the LICENSE file like this, and to stick with generic
! Copyright (C) by the DBCSR developers group - All rights reserved                                !
Otherwise we are back to the business of duplicating author notes as part of the source code. I think authorship is recorded as part of repository's metadata and no one is "left behind". As an extension, I also propose to drop AUTHORS file.

A general overhaul is to have a file extension .md for LICENSE (and perhaps similar files).

( sorry the proposal is not exactly related and can be perhaps a separate PR )

I like this proposal of having a separate LICENSE file. If all DBCSR maintainers agree, I can make the required changes (keep only the DBCSR Developers Group copyright line in each file, and add a new license file with the entire text of the license and copyright lines above that.)

gsitaram commented 1 year ago

@hfp, it does not seem like there are atomic add operations in OpenCL for FP type data. You may have to use compare and swap instead to accomplish the same. I found this blog showing an example, but this is not recent. I see that in the newer OpenCL (3.0) spec, the atomic_* operations are deprecated, and we have to use a combination of memory order and fence operations.. but I am not very familiar with this set of functions. If I find any recent examples somewhere, I'll be sure to point those to you.

hfp commented 1 year ago

Please allow me to share my suggestions for this PR:

Let's decide on the license banner etc solely based on input from @alazzaro. I would avoid surveying all/past authors. Also, the suggested change is not about dropping anyone's contribution; it's fully recorded in the GitHub repository and we only avoid duplicating this information along with exposing potentially outdated contact data.
I adjusted upstream master to provide a stub implementation of c_calculate_norms for the OpenCL backend. I hope I got the return code correct that allows to bail-out at runtime and to hopefully fall-back to host as long as c_calculate_norms is not implemented with an own OpenCL kernel. So, please consider to rebase or merge the upstream master.
I suggest rethinking the file extension of calculate_norms.cpp. I believe .cu might require less changes in the build system(s). However, this is not too important.

In my integration test (#649), I changed including some header files in calculate_norms.cpp from:

#if defined(__CUDA)
#  include "acc/cuda/acc_cuda.h"
#elif defined(__HIP)
#  include "acc/hip/acc_hip.h"
#endif
#include "libsmm_acc_init.h"

... to the following:

#if defined(__CUDA)
#  include "../cuda/acc_cuda.h"
#elif defined(__HIP)
#  include "../hip/acc_hip.h"
#endif
#include "libsmm_acc_init.h"

gsitaram commented 1 year ago

@hfp, I merged the master into this branch and changed the path to those header files as you suggested. As for the name of the file, hipcc takes .cpp extension, and nvcc can compile the .cpp file with the option -x cuda. So, I didn't change it. If the changes introduced to the build system are not acceptable, please feel free to change. After I got it to build, I didn't touch it. I'll discuss with Alfio and see how to proceed with the license blob. Thanks for your help in getting the opencl backend to accept these changes.

alazzaro commented 1 year ago

@gsitaram Thanks for this PR and sorry for the late reply (I would like to thank @hfp for his contribution too).

I'm putting here some of the topics/comments:

[x] Definitely I agree with @hfp proposal, so we can change our LICENSE file. @dev-zero comments?
[x] Understand how we can test it. Currently we cannot on github, so we can use the CI on Daint
[x] Move file calculate_norms.cpp to cuda_hip directory

I will add some specific comments in the files.

gsitaram commented 1 year ago

@alazzaro , please check if the previous commit reflects our discussion about avoiding the g2g path for all data types other than real_8.

hfp commented 1 year ago

I think Gina was right with the previous location of c_calculate_norms (under src/acc/libsmm_acc). That's the "library" also hosting transpose and multiply (in fact it's also CUDA/HIP-only). Another sign of calculate_norms.cpp belonging to LIBSMM_ACC is the latter only contains infrastructure-code like mem, stream, error, etc., and never contained calculation routines. Further, calculate_norms.cpp depends on libsmm_acc_init.h (acc_get_gpu_warp_size), which is now awkward after moving it to cuda_hip.

alazzaro commented 7 months ago

thanks @gsitaram !

cp2k / dbcsr

WIP: Add GPU aware MPI support in cannon algorithm #647

Codecov Report