Open VinInn opened 5 months ago
cms-bot internal usage
A new Issue was created by @VinInn.
@smuzaffar, @antoniovilela, @rappoccio, @Dr15Jones, @makortel, @sextonkennedy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction, heterogeneous
New categories assigned: reconstruction,heterogeneous
@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
@cms-sw/ecal-dpg-l2 @thomreis
Looking at the code, the only allocation I see inside the Kernel_time_computation_init
kernel is the shared memory (which on the serial CPU back-end is just plain memory).
@VinInn, as a simple test, could you try adding something like
for (auto i: cms::alpakatools::uniform_elements(acc, 2*elemsPerBlock)) {
shrSampleValues[i] = 0;
}
alpaka::syncBlockThreads(acc);
around line 804, and check if that makes valgrind happy ?
@thomreis, could you check if that has any impact on the results ?
I can try but valgrind is saying
created by a stack allocation
not
heap
Alpaka allocates "shared memory" for the CPU backend as a static
data member of a class templated on the kernel type... so that may count as a stack allocation ?
most probably yes...
one may check @ TaskKernelCpuSerial.hpp:51
zeroing shrSampleValues does not help WAIT, maybe not recompiled....
I confirm that the valgrind issue is still there. I set the memory to "junk"
memset(shrSampleValues,0xa5,2*elemsPerBlock*sizeof(ScalarType));
alpaka::syncBlockThreads(acc);
at least it does not crash....
Alpaka allocates "shared memory" for the CPU backend as a
static
data member of a class templated on the kernel type
@fwyzard Could you give more details? I'm now wondering how a static
data member would work with running multiple instances of the kernel concurrently (naively I'd assume a data race). Poking around I found
https://github.com/alpaka-group/alpaka/blob/1.1.0/include/alpaka/block/shared/dyn/BlockSharedMemDynMember.hpp#L83
as the memory block for both static and dynamic shared memory (that is not class-static
), but I could have easily missed something.
hi @makortel you are right, it's a regular data member, not a static
one (sorry I mixed things up)
I actually overlooked all these other ones
=865737==
==865737== Conditional jump or move depends on uninitialised value(s)
==865737== at 0xBA252001: UnknownInlinedFun (MultifitComputations.h:488)
==865737== by 0xBA252001: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737== Uninitialised value was created by a heap allocation
==865737== at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737== by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737== at 0xBA25200A: UnknownInlinedFun (MultifitComputations.h:490)
==865737== by 0xBA25200A: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737== Uninitialised value was created by a heap allocation
==865737== at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737== by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737== at 0xBA2522B0: UnknownInlinedFun (MultifitComputations.h:428)
==865737== by 0xBA2522B0: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737== Uninitialised value was created by a heap allocation
==865737== at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737== by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737== at 0xBA2522EC: UnknownInlinedFun (MultifitComputations.h:435)
==865737== by 0xBA2522EC: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737== Uninitialised value was created by a heap allocation
==865737== at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737== by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737== at 0xBA252EDA: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:243)
==865737== by 0xBA252EDA: UnknownInlinedFun (invoke.h:61)
--
==865737== Uninitialised value was created by a heap allocation
==865737== at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737== by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737== at 0xBA264B7F: UnknownInlinedFun (TimeComputationKernels.h:1030)
==865737== by 0xBA264B7F: UnknownInlinedFun (invoke.h:61)
--
==865737== Uninitialised value was created by a heap allocation
==865737== at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737== by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737== at 0xBA264C5B: UnknownInlinedFun (TimeComputationKernels.h:1071)
==865737== by 0xBA264C5B: UnknownInlinedFun (invoke.h:61)
--
==865737== Uninitialised value was created by a heap allocation
==865737== at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737== by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737== at 0xA2E1F648: UnknownInlinedFun (EcalRecHit.h:128)
==865737== by 0xA2E1F648: EcalRecHitSimpleAlgo::makeRecHit(EcalUncalibratedRecHit const&, float const&, float const&, unsigned int const&) const (EcalRecHitSimpleAlgo.h:46)
--
==865737== Uninitialised value was created by a heap allocation
==865737== at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737== by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
who may be also related to #44956
looking into the detailed traceback I see this
alpaka_cuda_async::EcalMultifitConditionsHostESProducer::produce(EcalMultifitConditionsRcd const&) (EcalMultifitConditionsHostESProducer.cc:74)
it seems that adding
diff --git a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
index ddba855853c..96a7792e9cb 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
@@ -72,6 +72,10 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
size_t numberOfXtals = pedestalsData.size();
auto product = std::make_unique<EcalMultifitConditionsHost>(numberOfXtals, cms::alpakatools::host());
+ {
+ alpaka::QueueCpuBlocking queue{cms::alpakatools::host()};
+ auto buff = product->buffer(); alpaka::memset(queue,buff, 0xa5);
+ }
auto view = product->view();
// Filling pedestals
it does not crash and all messages from Valgrind mentioned above disappear. Of course this is not necessarily a good news...
Reading the code of the ESProducer, the only field that seems to not be filled is the rawid
.
@VinInn could you check if adding
diff --git a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
index ddba855853c..d0ee97230ec 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
@@ -91,6 +91,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
for (unsigned int i = 0; i < barrelSize; ++i) {
auto vi = view[i];
+ vi.rawid() = 0;
vi.pedestals_mean_x12() = pedestalsEB[i].mean_x12;
vi.pedestals_rms_x12() = pedestalsEB[i].rms_x12;
@@ -113,6 +114,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
} // end Barrel loop
for (unsigned int i = 0; i < endcapSize; ++i) {
auto vi = view[barrelSize + i];
+ vi.rawid() = 0;
vi.pedestals_mean_x12() = pedestalsEE[i].mean_x12;
vi.pedestals_rms_x12() = pedestalsEE[i].rms_x12;
instead of the memset()
also makes the valgrind messages go away ?
@thomreis should EcalMultifitConditionsHostESProducer
fill the rawid
?
Mhm, grep
does not find any mention of rawid
in RecoLocalCalo/EcalRecProducers/
- so it doesn't seem to be used anywhere.
Then, I don't understand what un-initialised memory valgrind is complaining about :-(
no it is not enough... Is not that there are some "padding space" for alignment and somehow it is accessed?
Yes, there is padding between the columns... but it should not be accessed, unless the indices used to access the SoA are larger than its nominal capacity.
in RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h pedestal can be accessed with the channel index
849 pedestal = conditionsDev.pedestals_mean_x12()[ch];
while everywhere else is accessed with the HashedIndex @thomreis : is this intended or is a bug?
869 auto const mean_x6 = conditionsDev.pedestals_mean_x6()[hashedId];
870 auto const rms_x6 = conditionsDev.pedestals_rms_x6()[hashedId];
871 auto const gain12Over6 = conditionsDev.gain12Over6()[hashedId];
872 sample_value = (static_cast
etc
in addition the hasindex is computed from the digi rawid 833 auto const hashedId = isBarrel ? ecal::reconstruction::hashedIndexEB(did.rawId()) 834 : offsetForHashes + ecal::reconstruction::hashedIndexEE(did.rawId());
and if did.rawId() is junk (as apparently can be, as seen in the companion issue) hashedId can easily be anything....
Very good points.
We have the possibility of enabling runtime checks on the indices, let me dig how to do it. I actually think we should actually enable it by default, if the cost is not too high.
Can you try running with https://github.com/cms-sw/cmssw/pull/44987 ?
what is supposed to happen? anyhow no message
It's supposed to fail an assert or throw an exception if there is an out of bounds access (and if I made the correct changes).
Did https://github.com/cms-sw/cmssw/pull/45210 fix this ?
Debugging #44940 I run
(use the script by @mmusich in #44956 to set up the environment)
and see plenty of report of the type
I use something like this to get the full list
once compiled with -g RecoLocalCalo/EcalRecProducers it becomes
that does not give further hint (it should happen somewhere in https://cmssdt.cern.ch/lxr/source/RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h?v=CMSSW_14_1_X_2024-05-05-2300#0776 If I well understood.
So I'm not able to debug further