Closed fwyzard closed 2 years ago
$ ./hip --numberOfThreads 20 --numberOfStreams 20 --maxEvents 2000 --validation
Found 1 devices
Processing 2000 events, of which 20 concurrently, with 20 threads.
CountValidator: all 2000 events passed validation
Average relative track difference 0.000922907 (all within tolerance)
Average absolute vertex difference 0.0005 (all within tolerance)
Processed 2000 events in 6.039834e+00 seconds, throughput 331.135 events/s, CPU usage per thread: 31.2%
Now supports building either CUDA or ROCm/HIP:
$ make -j`nproc` alpaka ROCM_BASE= CUDA_BASE=/usr/local/cuda-11.5
...
$ source env.sh
$ ./alpaka --cuda --numberOfThreads 20 --numberOfStreams 20 --maxEvents 2000
Found 1 device:
- NVIDIA GeForce GTX 1080 Ti
Processing 2000 events, of which 20 concurrently, with 20 threads.
Processed 2000 events in 2.096939e+00 seconds, throughput 953.771 events/s, CPU usage per thread: 64.1%
$ make clean
rm -fR /data/user/fwyzard/pixeltrack-standalone/lib /data/user/fwyzard/pixeltrack-standalone/obj /data/user/fwyzard/pixeltrack-standalone/test alpaka alpakatest cuda cudacompat cudadev cudatest cudauvm fwtest hip hiptest kokkos kokkostest serial sycltest
$ rm env.sh
$ make -j`nproc` alpaka ROCM_BASE=/opt/rocm-5.0.2 CUDA_BASE=
...
$ source env.sh
$ ./alpaka --hip --numberOfThreads 20 --numberOfStreams 20 --maxEvents 2000
Found 1 device:
- Radeon Pro WX 9100
Processing 2000 events, of which 20 concurrently, with 20 threads.
Processed 2000 events in 4.311149e+00 seconds, throughput 463.913 events/s, CPU usage per thread: 73.1%
@makortel this PR has grown to be quite large... let me know if you would rather have it split into smaller ones.
By the way, I've tested that kokkostest
builds and run, but I could not get kokkos
to build, as it would get stuck while compiling or linking some of the tests.
By the way, I've tested that
kokkostest
builds and run, but I could not getkokkos
to build, as it would get stuck while compiling or linking some of the tests.
The compilation kokkos
program taking outrageously long for HIP is a known problem, see https://github.com/cms-patatrack/pixeltrack-standalone/issues/178#issuecomment-974859175 and the following discussion (has been reported to Kokkos, link at the bottom of the issue).
OK, then I won't worry about it.
this PR has grown to be quite large... let me know if you would rather have it split into smaller ones.
Looking at the commits I think splitting this PR into three could be worth it
Makefile
, hip', and
hiptest` (first three commits)alpaka
and alpakatest
(last two commits)OK, I've split it into
This PR needs to be merged after #347.
Rebased, and squashed the clang-format
changes.
Here is a comparison on V100 (1 set of 1 minute jobs)
I'm running another test with longer jobs and repetitions.
Here is a comparison on V100 with 4 2-minutes jobs
Within the statistical uncertainty (few %)
No differences on a T4, either:
Implement the changes for building the
alpakatest
andalpaka
applications with support for either of CUDA or HPI/ROCm.Host code changes
Since I hope to be able to enable both CUDA and HIP/ROCm at some point in the future, I've decided to split already now the relevant Alpaka types. Alpaka itself does a mixed effort on this:
using
aliases for the common typeI've added these last ones in
src/alpaka/alpaka/alpakaExtra.hpp
, with the intention of moving them into Alpaka itself sooner or later.Replace the use of the UniformCudaHipRt types with the explicit CudaRt types
Add similar code paths and definitions for the HipRt types
I've duplicated all CUDA-specific code, that was using either the
alpaka_cuda_async
namespace or theALPAKA_ACC_GPU_CUDA_ENABLED
macro with HIP/ROCm equivalent code, using thealpaka_rocm_async
namespace and theALPAKA_ACC_GPU_HIP_ENABLED
macro.Update the command line options in
main.cc
and the list of pluginsUpdate the code under the
.../alpaka/...
foldersMostly, I've changed
to
There are also some unrelated changes due to
clang
complaining about implicitly-deleted default constructors, the inappropriate use ofstd::move
, and some missing casts.Device code changes
The two places with a lot of changes are
src/alpaka/AlpakaCore/prefixScan.h
andsrc/alpaka/AlpakaCore/radixSort.h
:HIP does not support the masked warp instructions like
__shfl_up_sync
, it still has the pre-CUDA 9 versions like__shfl_up
, so I've#ifdef
ed them... eventually the whole code should be rewritten using the primitives provided by Alpaka, and benchmarked to make sure that does not introduce any regressions.I've also added (unconditionally) the memory fence from cms-patatrack/pixeltrack-standalone#210; this too should be benchmarked to check the impact on the CUDA implementation.