Open fwyzard opened 11 months ago
A new Issue was created by @fwyzard Andrea Bocci.
@sextonkennedy, @Dr15Jones, @antoniovilela, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign heterogeneous
type ecal
New categories assigned: heterogeneous
@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
This issue will list possible improvements to the Alpaka implementation of ECAL unpacking and local reconstruction, implemented in https://github.com/cms-sw/cmssw/pull/43257.
InputDataHost
fromALPAKA_ACCELERATOR_NAMESPACE::ecal::raw
toecal::raw
namespaceKernel_unpack
kernel to work with an arbitrary number of blocksKernel_unpack
Kernel_unpack
ecal::multifit::entryPoint()
to be more descriptivefloat
instead ofdouble
for the intermediate computationsRecoLocalCalo/EcalRecProducers/plugins/alpaka/DeclsForKernels.h
with a single SoAassert()
withALPAKA_ASSERT_ACC()
in device codereinterpret_cast<Matrix*>(ptr)
witheigen::Map<Matrix>(ptr)
Move
InputDataHost
fromALPAKA_ACCELERATOR_NAMESPACE::ecal::raw
toecal::raw
namespaceIn principle,
InputDataHost
could be moved out ofALPAKA_ACCELERATOR_NAMESPACE
intonamespace ecal::raw
, changing the constructor toto make it work with any kind of queue.
Afterwards it may be useful to add a type alias inside the
ALPAKA_ACCELERATOR_NAMESPACE::ecal::raw
namespace.Improve the
Kernel_unpack
kernel to work with an arbitrary number of blocksThe
ALPAKA_ACCELERATOR_NAMESPACE::ecal::raw::Kernel_unpack
kernel assumes that it will be called with one block per FED. To improve the possibilities of optimisations, it should be improved to support running with an arbitrary number of blocks (e.g. usingfor (auto ifed: blocks_with_stride(acc, nfedsWithData)
).Other improvements to
Kernel_unpack
offsets
and passnbytesTotal
asoffsets[nfedsWithData]
buffer
?auto
with the concrete type (e.g.uint64_t
); the compiler knows, but it takes forever to understand the codeiCh
andich
in the same function!std::memcpy
in a kernel ?Other potential issues with
Kernel_unpack
added on 2024/01/09
There are two potential issues with this kernel:
However, after re-reading the implementation, I think the implementation is correct, and these issues affect at most the potential for optimisations.
The first issue is not problematic, as long as the call to
Kernel_unpack
(on lines 428..439) uses the correct number of blocks. A more flexible approach would be to pass the number of FEDs as an additional argument, and add an outer loop so that a smaller number of blocks could process more FEDs. However, this is only a potential optimisation that would allow tuning the number of blocks, and an be left to a later time - if this kernel is ever identified as slow.Regarding the second issue, the current approach works because each "block" does anyway an internal loop over the channels. On the CPU the kernel launch does set the number of elements per thread to 32, but this is ignored. However, taking it into account would result in two nested loops, but effectively process the data in the same way.
Improve the block shared memory allocation approach
In
Kernel_prep_1d_and_initialize
Kernel_minimize
Kernel_time_compute_nullhypot
Kernel_time_compute_makeratio
Kernel_time_compute_findamplchi2_and_finish
Kernel_time_computation_init
the size, allocation and use of block shared memory is based on non-trivial pointer arithmetic.
This is very similar to what is done inside the constructor of a SoA to split a single memory block into the various scalars and columns.
It would be "safer" to reuse the same mechanism, for example defining a SoA with
GENERATE_SOA_LAYOUT
, and requesting a small (e.g. 4 or 8 bytes) memory alignment.Reuse block-definition constants in host and device code
In
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h
we haveIn
RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalUncalibRecHitMultiFitAlgoPortable.dev.cc
we have the same constants:It would be better to reuse the same definition across host and device code. A possibility could be to move their definition to be
static constexpr
member variables ofKernel_time_compute_makeratio
.Improve all kernels to use a configurable number of blocks
The kernels
Kernel_prep_1d_and_initialize
Kernel_prep_2d
Kernel_minimize
Kernel_time_computation_init
Kernel_time_compute_fixMGPAslew
Kernel_time_compute_nullhypot
Kernel_time_compute_makeratio
Kernel_time_compute_findamplchi2_and_finish
Kernel_time_correction_and_finalize
are launched with enough blocks to cover the whole problem space. Those that use
elements_with_stride
andblocks_with_stride
should already support running with an arbitrary number of blocks. It should be checked that this is really the case. The others, if any, should be improved to support running with an arbitrary number of blocks. Finally, the number of blocks should be optimised.Evaluate the impact of using
float
instead ofdouble
for the intermediate computationsMany places in the minimisation steps use double precisions (
double
) numbers instead of single precision (float
) ones. Since many GPUs have a much smaller performance usingdouble
instead offloat
, it would be interesting to understand what is the impact of usingfloat
instead ofdouble
in those computations.Replace the many buffers in
RecoLocalCalo/EcalRecProducers/plugins/alpaka/DeclsForKernels.h
with a single SoARecoLocalCalo/EcalRecProducers/plugins/alpaka/DeclsForKernels.h
declares 25 different device memory buffers. Replacing them with one or two SoA data structures should reduce the amount of operations involved in the memory allocation and copies.Replace
assert()
withALPAKA_ASSERT_ACC()
in device codeassert()
is expensive in device code, so we useALPAKA_ASSERT_ACC()
to selectively disable it when compiling for CUDA and ROCm GPUs, and enable it when compiling for CPU.Replace
reinterpret_cast<Matrix*>(ptr)
witheigen::Map<Matrix>(ptr)
Various parts of the device code (e.g. in
RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalUncalibRecHitMultiFitAlgoPortable.dev.cc
) usereinterpret_cast<Matrix*>(ptr)
to access a pointer to a buffer offloat
as an Eigen matrix. This works because a matrix with compile-time dimensions is stored as aEigen offers a less implementation-dependent approach to interpret a buffer of
float
as a matrix:eigen::Map
, https://eigen.tuxfamily.org/dox/group__TutorialMapClass.html.