cms-patatrack / pixeltrack-standalone

Standalone Patatrack pixel tracking
Apache License 2.0
17 stars 35 forks source link

[alpaka] Support CUDA or ROCm/HIP #342

Closed fwyzard closed 2 years ago

fwyzard commented 2 years ago

Implement the changes for building the alpakatest and alpaka applications with support for either of CUDA or HPI/ROCm.

Host code changes

  1. Add Cuda and Hip types

Since I hope to be able to enable both CUDA and HIP/ROCm at some point in the future, I've decided to split already now the relevant Alpaka types. Alpaka itself does a mixed effort on this:

I've added these last ones in src/alpaka/alpaka/alpakaExtra.hpp, with the intention of moving them into Alpaka itself sooner or later.

  1. Replace the use of the UniformCudaHipRt types with the explicit CudaRt types

  2. Add similar code paths and definitions for the HipRt types

I've duplicated all CUDA-specific code, that was using either the alpaka_cuda_async namespace or the ALPAKA_ACC_GPU_CUDA_ENABLED macro with HIP/ROCm equivalent code, using the alpaka_rocm_async namespace and the ALPAKA_ACC_GPU_HIP_ENABLED macro.

  1. Update the command line options in main.cc and the list of plugins

  2. Update the code under the .../alpaka/... folders

Mostly, I've changed

#ifdef ALPAKA_ACC_GPU_CUDA_ASYNC_BACKEND

to

#if defined(ALPAKA_ACC_GPU_CUDA_ASYNC_BACKEND) || defined(ALPAKA_ACC_GPU_HIP_ASYNC_BACKEND)
  1. Unrelated changes

There are also some unrelated changes due to clang complaining about implicitly-deleted default constructors, the inappropriate use of std::move, and some missing casts.

Device code changes

The two places with a lot of changes are src/alpaka/AlpakaCore/prefixScan.h and src/alpaka/AlpakaCore/radixSort.h:

HIP does not support the masked warp instructions like __shfl_up_sync, it still has the pre-CUDA 9 versions like __shfl_up, so I've #ifdefed them... eventually the whole code should be rewritten using the primitives provided by Alpaka, and benchmarked to make sure that does not introduce any regressions.

I've also added (unconditionally) the memory fence from cms-patatrack/pixeltrack-standalone#210; this too should be benchmarked to check the impact on the CUDA implementation.

fwyzard commented 2 years ago
$ ./hip --numberOfThreads 20 --numberOfStreams 20 --maxEvents 2000 --validation
Found 1 devices
Processing 2000 events, of which 20 concurrently, with 20 threads.
CountValidator: all 2000 events passed validation
 Average relative track difference 0.000922907 (all within tolerance)
 Average absolute vertex difference 0.0005 (all within tolerance)
Processed 2000 events in 6.039834e+00 seconds, throughput 331.135 events/s, CPU usage per thread: 31.2%
fwyzard commented 2 years ago

Now supports building either CUDA or ROCm/HIP:

$ make -j`nproc` alpaka ROCM_BASE= CUDA_BASE=/usr/local/cuda-11.5
...

$ source env.sh
$ ./alpaka --cuda --numberOfThreads 20 --numberOfStreams 20 --maxEvents 2000
Found 1 device:
  - NVIDIA GeForce GTX 1080 Ti
Processing 2000 events, of which 20 concurrently, with 20 threads.
Processed 2000 events in 2.096939e+00 seconds, throughput 953.771 events/s, CPU usage per thread: 64.1%

$ make clean
rm -fR /data/user/fwyzard/pixeltrack-standalone/lib /data/user/fwyzard/pixeltrack-standalone/obj /data/user/fwyzard/pixeltrack-standalone/test alpaka alpakatest cuda cudacompat cudadev cudatest cudauvm fwtest hip hiptest kokkos kokkostest serial sycltest
$ rm env.sh 
$ make -j`nproc` alpaka ROCM_BASE=/opt/rocm-5.0.2 CUDA_BASE=
...

$ source env.sh
$ ./alpaka --hip --numberOfThreads 20 --numberOfStreams 20 --maxEvents 2000
Found 1 device:
  - Radeon Pro WX 9100
Processing 2000 events, of which 20 concurrently, with 20 threads.
Processed 2000 events in 4.311149e+00 seconds, throughput 463.913 events/s, CPU usage per thread: 73.1%
fwyzard commented 2 years ago

@makortel this PR has grown to be quite large... let me know if you would rather have it split into smaller ones.

fwyzard commented 2 years ago

By the way, I've tested that kokkostest builds and run, but I could not get kokkos to build, as it would get stuck while compiling or linking some of the tests.

makortel commented 2 years ago

By the way, I've tested that kokkostest builds and run, but I could not get kokkos to build, as it would get stuck while compiling or linking some of the tests.

The compilation kokkos program taking outrageously long for HIP is a known problem, see https://github.com/cms-patatrack/pixeltrack-standalone/issues/178#issuecomment-974859175 and the following discussion (has been reported to Kokkos, link at the bottom of the issue).

fwyzard commented 2 years ago

OK, then I won't worry about it.

makortel commented 2 years ago

this PR has grown to be quite large... let me know if you would rather have it split into smaller ones.

Looking at the commits I think splitting this PR into three could be worth it

fwyzard commented 2 years ago

OK, I've split it into

This PR needs to be merged after #347.

fwyzard commented 2 years ago

Rebased, and squashed the clang-format changes.

makortel commented 2 years ago

Here is a comparison on V100 (1 set of 1 minute jobs) alpaka_cuda_throughput

I'm running another test with longer jobs and repetitions.

makortel commented 2 years ago

Here is a comparison on V100 with 4 2-minutes jobs alpaka_cuda_throughput

Within the statistical uncertainty (few %)

fwyzard commented 2 years ago

No differences on a T4, either: image image