Follow up to the Alpaka Implementation of the ECAL local reconstruction

fwyzard commented 11 months ago

This issue will list possible improvements to the Alpaka implementation of ECAL unpacking and local reconstruction, implemented in https://github.com/cms-sw/cmssw/pull/43257.

[ ] Move InputDataHost from ALPAKA_ACCELERATOR_NAMESPACE::ecal::raw to ecal::raw namespace
[ ] Improve the Kernel_unpack kernel to work with an arbitrary number of blocks
[ ] Other improvements to Kernel_unpack
[ ] Other potential issues with Kernel_unpack
[ ] Rename ecal::multifit::entryPoint() to be more descriptive
[ ] Improve the block shared memory allocation approach
[ ] Reuse block-definition constants in host and device code
[ ] Improve all kernels to use a configurable number of blocks
[ ] Evaluate the impact of using float instead of double for the intermediate computations
[ ] Replace the many buffers in RecoLocalCalo/EcalRecProducers/plugins/alpaka/DeclsForKernels.h with a single SoA
[ ] Replace assert() with ALPAKA_ASSERT_ACC() in device code
[ ] Replace reinterpret_cast<Matrix*>(ptr) with eigen::Map<Matrix>(ptr)

Move `InputDataHost` from `ALPAKA_ACCELERATOR_NAMESPACE::ecal::raw` to `ecal::raw` namespace

In principle, InputDataHost could be moved out of ALPAKA_ACCELERATOR_NAMESPACE into namespace ecal::raw, changing the constructor to

    template <typename TQueue>
    explicit InputDataHost(const TQueue& queue, size_t size, size_t nFeds)

to make it work with any kind of queue.

Afterwards it may be useful to add a type alias inside the ALPAKA_ACCELERATOR_NAMESPACE::ecal::raw namespace.

Improve the `Kernel_unpack` kernel to work with an arbitrary number of blocks

The ALPAKA_ACCELERATOR_NAMESPACE::ecal::raw::Kernel_unpack kernel assumes that it will be called with one block per FED. To improve the possibilities of optimisations, it should be improved to support running with an arbitrary number of blocks (e.g. using for (auto ifed: blocks_with_stride(acc, nfedsWithData)).

Other improvements to `Kernel_unpack`

add 1 more element to offsets and pass nbytesTotal as offsets[nfedsWithData]
check the alignment of buffer ?
handle "unsupported trigger type" ?
at least for scalars and pointers, replace auto with the concrete type (e.g. uint64_t); the compiler knows, but it takes forever to understand the code
do not use iCh and ich in the same function!
can we even use std::memcpy in a kernel ?

Other potential issues with `Kernel_unpack`

added on 2024/01/09

There are two potential issues with this kernel:

it must be called with the correct number of blocks;
it ignores the elements per thread.

However, after re-reading the implementation, I think the implementation is correct, and these issues affect at most the potential for optimisations.

The first issue is not problematic, as long as the call to Kernel_unpack (on lines 428..439) uses the correct number of blocks. A more flexible approach would be to pass the number of FEDs as an additional argument, and add an outer loop so that a smaller number of blocks could process more FEDs. However, this is only a potential optimisation that would allow tuning the number of blocks, and an be left to a later time - if this kernel is ever identified as slow.

Regarding the second issue, the current approach works because each "block" does anyway an internal loop over the channels. On the CPU the kernel launch does set the number of elements per thread to 32, but this is ignored. However, taking it into account would result in two nested loops, but effectively process the data in the same way.

Improve the block shared memory allocation approach

In

Kernel_prep_1d_and_initialize
Kernel_minimize
Kernel_time_compute_nullhypot
Kernel_time_compute_makeratio
Kernel_time_compute_findamplchi2_and_finish
Kernel_time_computation_init

the size, allocation and use of block shared memory is based on non-trivial pointer arithmetic.

This is very similar to what is done inside the constructor of a SoA to split a single memory block into the various scalars and columns.

It would be "safer" to reuse the same mechanism, for example defining a SoA with GENERATE_SOA_LAYOUT, and requesting a small (e.g. 4 or 8 bytes) memory alignment.

Reuse block-definition constants in host and device code

In RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h we have

class Kernel_time_compute_makeratio {
  ...
  public:
    template <typename TAcc, typename = std::enable_if_t<alpaka::isAccelerator<TAcc>>>
    ALPAKA_FN_ACC void operator()(TAcc const& acc, ...) {

      // constants
      constexpr uint32_t nchannels_per_block = 10;
      constexpr auto nthreads_per_channel = nchannels_per_block * (nchannels_per_block - 1) / 2;

In RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalUncalibRecHitMultiFitAlgoPortable.dev.cc we have the same constants:

      constexpr uint32_t nchannels_per_block_makeratio = 10;
      constexpr auto nthreads_per_channel = nchannels_per_block_makeratio * (nchannels_per_block_makeratio - 1) / 2;  // n(n-1)/2

It would be better to reuse the same definition across host and device code. A possibility could be to move their definition to be static constexpr member variables of Kernel_time_compute_makeratio.

Improve all kernels to use a configurable number of blocks

The kernels

Kernel_prep_1d_and_initialize
Kernel_prep_2d
Kernel_minimize
Kernel_time_computation_init
Kernel_time_compute_fixMGPAslew
Kernel_time_compute_nullhypot
Kernel_time_compute_makeratio
Kernel_time_compute_findamplchi2_and_finish
Kernel_time_correction_and_finalize
are launched with enough blocks to cover the whole problem space. Those that use elements_with_stride and blocks_with_stride should already support running with an arbitrary number of blocks. It should be checked that this is really the case. The others, if any, should be improved to support running with an arbitrary number of blocks. Finally, the number of blocks should be optimised.

Evaluate the impact of using `float` instead of `double` for the intermediate computations

Many places in the minimisation steps use double precisions (double) numbers instead of single precision (float) ones. Since many GPUs have a much smaller performance using double instead of float, it would be interesting to understand what is the impact of using float instead of double in those computations.

Replace the many buffers in `RecoLocalCalo/EcalRecProducers/plugins/alpaka/DeclsForKernels.h` with a single SoA

RecoLocalCalo/EcalRecProducers/plugins/alpaka/DeclsForKernels.h declares 25 different device memory buffers. Replacing them with one or two SoA data structures should reduce the amount of operations involved in the memory allocation and copies.

Replace `assert()` with `ALPAKA_ASSERT_ACC()` in device code

assert() is expensive in device code, so we use ALPAKA_ASSERT_ACC() to selectively disable it when compiling for CUDA and ROCm GPUs, and enable it when compiling for CPU.

Replace `reinterpret_cast<Matrix*>(ptr)` with `eigen::Map<Matrix>(ptr)`

Various parts of the device code (e.g. in RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalUncalibRecHitMultiFitAlgoPortable.dev.cc) use reinterpret_cast<Matrix*>(ptr) to access a pointer to a buffer of float as an Eigen matrix. This works because a matrix with compile-time dimensions is stored as a

struct Matrix {
  T data[Rows*Cols];
};

Eigen offers a less implementation-dependent approach to interpret a buffer of float as a matrix: eigen::Map, https://eigen.tuxfamily.org/dox/group__TutorialMapClass.html.

cmsbuild commented 11 months ago

A new Issue was created by @fwyzard Andrea Bocci.

@sextonkennedy, @Dr15Jones, @antoniovilela, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard commented 11 months ago

assign heterogeneous

fwyzard commented 11 months ago

type ecal

cmsbuild commented 11 months ago

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cms-sw / cmssw