[alpaka] Support all alpaka backends at the same time

fwyzard commented 2 years ago

Let alpaka and alpakatest support serial, TBB, CUDA and ROCm at the same time.

Each backend can take an optional weight, and if more than one backend is specified the number of streams will be split among the backends roughly according to their weights:

./alpaka --maxEvents 10000 --numberOfStreams 16 --numberOfThreads 8 --serial 0.2 --cuda 0.5 --hip 0.3 --validation
Found 1 device:
  - AMD Ryzen 9 5900X 12-Core Processor            
Found 1 device:
  - NVIDIA GeForce GTX 1080 Ti
Found 1 device:
  - Radeon Pro WX 9100
Processing 10000 events, with 16 concurrent events (5 on rocm_async, 3 on serial_sync, 8 on cuda_async) and 8 threads.
CountValidator: all 2997 events passed validation
 Average relative track difference 0.000920349 (all within tolerance)
Processed 10000 events in 9.212266e+00 seconds, throughput 1085.51 events/s, CPU usage per thread: 93.8%

Note that the CountValidator prints only a partial count because endJob() is only called for "stream 0", and so only one of the different instantiations of ALPAKA_ACCELERATOR_NAMESPACE::CountValidator gets called.

Static splitting of event streams across multiple backends: each event stream is associated to a different backend, according to the optional weight specified on the command line.

Split the compilation by backend:

the ALPAKA_..._ENABLED macros are only defined one at a time;
introduce new ALPAKA_..._PRESENT macros to identify all backends for which support is being compiled;
link the backend-specific libraries to the backend-specific tests;
link all libraries to the main executable, including the "portable", backend-specific ones.

Add forward declaration for alpaka templates and types (thanks to Matti for the idea). Add explicit instantiation definitions and declarations to the initialisation code, and move it to the AlpakaCore "portable" library. Use new pinned host memory functionality, introduced in the latest alpaka update.

Update alpaka to the fwyzard/develop private branch, pending integration upstream. Relevant changes include:

implement separate types for the CUDA and HIP/ROCm backends;
add a new API to allocate pinned host memory (pending upstream);
add ALPAKA_DEFAULT_HOST_MEMORY_ALIGNMENT macro (pending).

Autogenerate plugins.txt from the content of the plugins' shared libraries.

fwyzard commented 2 years ago

~~The cleanup could be split to a separate PR.~~ Done.

~~The src/alpaka vs src/alpakatest changes should be split to a separate PR.~~ Both src/alpaka and src/alpakatest are fully supported in this PR.

~~The alpaka library changes should be merged upstream.~~ All changes have been merged upstream (post 0.9.0)

fwyzard commented 2 years ago

@waredjeb @tonydp03 FYI

fwyzard commented 2 years ago

Now supports both alpaka and alpakatest.

The src/.../Makefile changes now should be more robust, and all tests do build.

The only pending issue AFAICT is the integration upstream of https://github.com/alpaka-group/alpaka/pull/1685 .

fwyzard commented 2 years ago

As a side note, currently specifying more than one device (e.g. alpaka --serial --cuda) runs both set of modules (the CPU-serial ones and the CUDA ones) on each event.

I think it would be useful to let the framework pick a single different "device" for each event. For example in round robin, or in round robin with a different number of slots per device type, etc.

makortel commented 2 years ago

I think it would be useful to let the framework pick a single different "device" for each event. For example in round robin, or in round robin with a different number of slots per device type, etc.

I agree this could be an interesting mode of operation to try out at some point. Maybe open an issue about it? I suspect it won't be straightforward, so figuring out a reasonable approach in the mock framework could take some time (and this is something that could be looked after the Alpaka integration into CMSSW has been finished, for the first round at least, right?).

fwyzard commented 2 years ago

I also thought it would be complicated, then I slept (not much) over it, and had an idea for a simple implementation this morning, and it turned out that a static assignment of the "event streams" to different backends is not too bad: https://github.com/fwyzard/pixeltrack-standalone/commit/a40e22221f663c54e61a9ee9384c2ed4b3cc2420 .

Each backend can now take an optional weight, and if more than one backend is specified the number of streams will be split among the backends roughly according to their weights:

./alpaka --maxEvents 10000 --numberOfStreams 16 --numberOfThreads 8 --serial 0.2 --cuda 0.5 --hip 0.3 --validation
Found 1 device:
  - AMD Ryzen 9 5900X 12-Core Processor            
Found 1 device:
  - NVIDIA GeForce GTX 1080 Ti
Found 1 device:
  - Radeon Pro WX 9100
Processing 10000 events, with 16 concurrent events (5 on rocm_async, 3 on serial_sync, 8 on cuda_async) and 8 threads.
CountValidator: all 2997 events passed validation
 Average relative track difference 0.000920349 (all within tolerance)
Processed 10000 events in 9.212266e+00 seconds, throughput 1085.51 events/s, CPU usage per thread: 93.8%

(looks like the CountValidator may need some update)

fwyzard commented 2 years ago

(looks like the CountValidator may need some update)

Actually the module itself is fine - the reason is that endJob() is only called for "stream 0", and so only one of the different instantiations gets called.

cms-patatrack / pixeltrack-standalone

[alpaka] Support all alpaka backends at the same time #357