cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.33k forks source link

Follow up to the Alpaka Implementation of the ECAL local reconstruction #43551

Open fwyzard opened 11 months ago

fwyzard commented 11 months ago

This issue will list possible improvements to the Alpaka implementation of ECAL unpacking and local reconstruction, implemented in https://github.com/cms-sw/cmssw/pull/43257.


Move InputDataHost from ALPAKA_ACCELERATOR_NAMESPACE::ecal::raw to ecal::raw namespace

In principle, InputDataHost could be moved out of ALPAKA_ACCELERATOR_NAMESPACE into namespace ecal::raw, changing the constructor to

    template <typename TQueue>
    explicit InputDataHost(const TQueue& queue, size_t size, size_t nFeds)

to make it work with any kind of queue.

Afterwards it may be useful to add a type alias inside the ALPAKA_ACCELERATOR_NAMESPACE::ecal::raw namespace.


Improve the Kernel_unpack kernel to work with an arbitrary number of blocks

The ALPAKA_ACCELERATOR_NAMESPACE::ecal::raw::Kernel_unpack kernel assumes that it will be called with one block per FED. To improve the possibilities of optimisations, it should be improved to support running with an arbitrary number of blocks (e.g. using for (auto ifed: blocks_with_stride(acc, nfedsWithData)).


Other improvements to Kernel_unpack


Other potential issues with Kernel_unpack

added on 2024/01/09

There are two potential issues with this kernel:

However, after re-reading the implementation, I think the implementation is correct, and these issues affect at most the potential for optimisations.

The first issue is not problematic, as long as the call to Kernel_unpack (on lines 428..439) uses the correct number of blocks. A more flexible approach would be to pass the number of FEDs as an additional argument, and add an outer loop so that a smaller number of blocks could process more FEDs. However, this is only a potential optimisation that would allow tuning the number of blocks, and an be left to a later time - if this kernel is ever identified as slow.

Regarding the second issue, the current approach works because each "block" does anyway an internal loop over the channels. On the CPU the kernel launch does set the number of elements per thread to 32, but this is ignored. However, taking it into account would result in two nested loops, but effectively process the data in the same way.


Improve the block shared memory allocation approach

In

the size, allocation and use of block shared memory is based on non-trivial pointer arithmetic.

This is very similar to what is done inside the constructor of a SoA to split a single memory block into the various scalars and columns.

It would be "safer" to reuse the same mechanism, for example defining a SoA with GENERATE_SOA_LAYOUT, and requesting a small (e.g. 4 or 8 bytes) memory alignment.


Reuse block-definition constants in host and device code

In RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h we have

class Kernel_time_compute_makeratio {
  ...
  public:
    template <typename TAcc, typename = std::enable_if_t<alpaka::isAccelerator<TAcc>>>
    ALPAKA_FN_ACC void operator()(TAcc const& acc, ...) {

      // constants
      constexpr uint32_t nchannels_per_block = 10;
      constexpr auto nthreads_per_channel = nchannels_per_block * (nchannels_per_block - 1) / 2;

In RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalUncalibRecHitMultiFitAlgoPortable.dev.cc we have the same constants:

      constexpr uint32_t nchannels_per_block_makeratio = 10;
      constexpr auto nthreads_per_channel = nchannels_per_block_makeratio * (nchannels_per_block_makeratio - 1) / 2;  // n(n-1)/2

It would be better to reuse the same definition across host and device code. A possibility could be to move their definition to be static constexpr member variables of Kernel_time_compute_makeratio.


Improve all kernels to use a configurable number of blocks

The kernels


Evaluate the impact of using float instead of double for the intermediate computations

Many places in the minimisation steps use double precisions (double) numbers instead of single precision (float) ones. Since many GPUs have a much smaller performance using double instead of float, it would be interesting to understand what is the impact of using float instead of double in those computations.


Replace the many buffers in RecoLocalCalo/EcalRecProducers/plugins/alpaka/DeclsForKernels.h with a single SoA

RecoLocalCalo/EcalRecProducers/plugins/alpaka/DeclsForKernels.h declares 25 different device memory buffers. Replacing them with one or two SoA data structures should reduce the amount of operations involved in the memory allocation and copies.


Replace assert() with ALPAKA_ASSERT_ACC() in device code

assert() is expensive in device code, so we use ALPAKA_ASSERT_ACC() to selectively disable it when compiling for CUDA and ROCm GPUs, and enable it when compiling for CPU.


Replace reinterpret_cast<Matrix*>(ptr) with eigen::Map<Matrix>(ptr)

Various parts of the device code (e.g. in RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalUncalibRecHitMultiFitAlgoPortable.dev.cc) use reinterpret_cast<Matrix*>(ptr) to access a pointer to a buffer of float as an Eigen matrix. This works because a matrix with compile-time dimensions is stored as a

struct Matrix {
  T data[Rows*Cols];
};

Eigen offers a less implementation-dependent approach to interpret a buffer of float as a matrix: eigen::Map, https://eigen.tuxfamily.org/dox/group__TutorialMapClass.html.

cmsbuild commented 11 months ago

A new Issue was created by @fwyzard Andrea Bocci.

@sextonkennedy, @Dr15Jones, @antoniovilela, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard commented 11 months ago

assign heterogeneous

fwyzard commented 11 months ago

type ecal

cmsbuild commented 11 months ago

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks