cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

Valgrind reports plenty of "Uninitialised value" created by a stack allocation in an Alpaka-kernel in pluginRecoLocalCaloEcalRecProducersPluginsPortableSerialSync.so #44957

Open VinInn opened 5 months ago

VinInn commented 5 months ago

Debugging #44940 I run

valgrind --tool=memcheck --leak-check=full --track-origins=yes --suppressions=${ROOTSYS}/etc/valgrind-root.supp  cmsRun  hlt_run380466.py

(use the script by @mmusich in #44956 to set up the environment)

and see plenty of report of the type

==855852== Conditional jump or move depends on uninitialised value(s)
==855852==    at 0xA8813D24: void std::__introsort_loop<__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Ite
r_comp_iter<NumericSafeGreaterByPt<reco::PFJet> > >(__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, __gnu_cxx::__normal_iterator<rec
o::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<NumericSafeGreaterByPt<reco::PFJet> >) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/n
week-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0xA8813DC3: void std::__introsort_loop<__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Ite
r_comp_iter<NumericSafeGreaterByPt<reco::PFJet> > >(__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, __gnu_cxx::__normal_iterator<rec
o::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<NumericSafeGreaterByPt<reco::PFJet> >) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/n
week-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0xA880BC90: reco::CorrectedJetProducer<reco::PFJet>::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/e
l9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0x4A8B131: edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/
x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4A8495B: edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nwee
k-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4A11528: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__ex
ception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTra
its<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_
gcc12/libFWCoreFramework.so)
==855852==    by 0x4A1B997: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/
el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4E74F27: tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64
/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreConcurrency.so)
==855852==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:322)
==855852==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:458)
==855852==    by 0x640991A: UnknownInlinedFun (arena.cpp:137)
==855852==    by 0x640991A: tbb::detail::r1::market::process(rml::job&) (market.cpp:599)
==855852==    by 0x640BACD: UnknownInlinedFun (private_server.cpp:271)
==855852==    by 0x640BACD: tbb::detail::r1::rml::private_worker::thread_routine(void*) (private_server.cpp:221)
==855852==    by 0x68C6801: start_thread (in /usr/lib64/libc.so.6)
==855852==  Uninitialised value was created by a stack allocation
==855852==    at 0xBA2A6EDD: alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka_serial_sync::ecal::multifit::Kernel_time_computation_init, Ecal
DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, EcalDigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const
&, EcalMultifitConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, float*, float*, float*, bool*, char*>::operator()() const (in /cvmfs/cms
-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginRecoLocalCaloEcalRecProducersPluginsPortableSerialSync.so)
==855852==
==855852== Conditional jump or move depends on uninitialised value(s)
==855852==    at 0xA8813D24: void std::__introsort_loop<__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Ite
r_comp_iter<NumericSafeGreaterByPt<reco::PFJet> > >(__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, __gnu_cxx::__normal_iterator<rec
o::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<NumericSafeGreaterByPt<reco::PFJet> >) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/n
week-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0xA8813DC3: void std::__introsort_loop<__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Ite
r_comp_iter<NumericSafeGreaterByPt<reco::PFJet> > >(__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, __gnu_cxx::__normal_iterator<rec
o::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<NumericSafeGreaterByPt<reco::PFJet> >) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/n
week-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0xA880BC90: reco::CorrectedJetProducer<reco::PFJet>::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/e
l9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0x4A8B131: edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/
x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4A8495B: edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nwee
k-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4A11528: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__ex
ception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTra
its<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_
gcc12/libFWCoreFramework.so)
==855852==    by 0x4A1B997: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/
el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4E74F27: tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64
/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreConcurrency.so)
==855852==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:322)
==855852==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:458)
==855852==    by 0x640991A: UnknownInlinedFun (arena.cpp:137)
==855852==    by 0x640991A: tbb::detail::r1::market::process(rml::job&) (market.cpp:599)
==855852==    by 0x640BACD: UnknownInlinedFun (private_server.cpp:271)
==855852==    by 0x640BACD: tbb::detail::r1::rml::private_worker::thread_routine(void*) (private_server.cpp:221)
==855852==    by 0x68C6801: start_thread (in /usr/lib64/libc.so.6)
==855852==  Uninitialised value was created by a stack allocation
==855852==    at 0xBA2A6EDD: alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka_serial_sync::ecal::multifit::Kernel_time_computation_init, Ecal
DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, EcalDigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const
&, EcalMultifitConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, float*, float*, float*, bool*, char*>::operator()() const (in /cvmfs/cms
-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginRecoLocalCaloEcalRecProducersPluginsPortableSerialSync.so)
==855852==

I use something like this to get the full list

egrep -A2 "Conditional jump or move depends on uninitialised|Uninitialised value was created" /data/user/innocent/hltBug/valg4.log | less

once compiled with -g RecoLocalCalo/EcalRecProducers it becomes

==865737==  Uninitialised value was created by a stack allocation
==865737==    at 0xBA263EDD: alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka_serial_sync::ecal::multifit::Kernel_time_computation_init, EcalDigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, EcalDigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, EcalMultifitConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, float*, float*, float*, bool*, char*>::operator()() const (TaskKernelCpuSerial.hpp:51)

that does not give further hint (it should happen somewhere in https://cmssdt.cern.ch/lxr/source/RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h?v=CMSSW_14_1_X_2024-05-05-2300#0776 If I well understood.

So I'm not able to debug further

cmsbuild commented 5 months ago

cms-bot internal usage

cmsbuild commented 5 months ago

A new Issue was created by @VinInn.

@smuzaffar, @antoniovilela, @rappoccio, @Dr15Jones, @makortel, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 5 months ago

assign reconstruction, heterogeneous

cmsbuild commented 5 months ago

New categories assigned: reconstruction,heterogeneous

@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 5 months ago

@cms-sw/ecal-dpg-l2 @thomreis

fwyzard commented 5 months ago

Looking at the code, the only allocation I see inside the Kernel_time_computation_init kernel is the shared memory (which on the serial CPU back-end is just plain memory).

@VinInn, as a simple test, could you try adding something like

for (auto i: cms::alpakatools::uniform_elements(acc, 2*elemsPerBlock)) {
  shrSampleValues[i] = 0;
}
alpaka::syncBlockThreads(acc);

around line 804, and check if that makes valgrind happy ?

@thomreis, could you check if that has any impact on the results ?

VinInn commented 5 months ago

I can try but valgrind is saying created by a stack allocation not heap

fwyzard commented 5 months ago

Alpaka allocates "shared memory" for the CPU backend as a static data member of a class templated on the kernel type... so that may count as a stack allocation ?

VinInn commented 5 months ago

most probably yes... one may check @ TaskKernelCpuSerial.hpp:51

VinInn commented 5 months ago

zeroing shrSampleValues does not help WAIT, maybe not recompiled....

VinInn commented 5 months ago

I confirm that the valgrind issue is still there. I set the memory to "junk"

      memset(shrSampleValues,0xa5,2*elemsPerBlock*sizeof(ScalarType));
      alpaka::syncBlockThreads(acc);

at least it does not crash....

makortel commented 5 months ago

Alpaka allocates "shared memory" for the CPU backend as a static data member of a class templated on the kernel type

@fwyzard Could you give more details? I'm now wondering how a static data member would work with running multiple instances of the kernel concurrently (naively I'd assume a data race). Poking around I found https://github.com/alpaka-group/alpaka/blob/1.1.0/include/alpaka/block/shared/dyn/BlockSharedMemDynMember.hpp#L83 as the memory block for both static and dynamic shared memory (that is not class-static), but I could have easily missed something.

fwyzard commented 5 months ago

hi @makortel you are right, it's a regular data member, not a static one (sorry I mixed things up)

VinInn commented 5 months ago

I actually overlooked all these other ones

=865737==
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA252001: UnknownInlinedFun (MultifitComputations.h:488)
==865737==    by 0xBA252001: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA25200A: UnknownInlinedFun (MultifitComputations.h:490)
==865737==    by 0xBA25200A: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA2522B0: UnknownInlinedFun (MultifitComputations.h:428)
==865737==    by 0xBA2522B0: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA2522EC: UnknownInlinedFun (MultifitComputations.h:435)
==865737==    by 0xBA2522EC: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA252EDA: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:243)
==865737==    by 0xBA252EDA: UnknownInlinedFun (invoke.h:61)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA264B7F: UnknownInlinedFun (TimeComputationKernels.h:1030)
==865737==    by 0xBA264B7F: UnknownInlinedFun (invoke.h:61)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA264C5B: UnknownInlinedFun (TimeComputationKernels.h:1071)
==865737==    by 0xBA264C5B: UnknownInlinedFun (invoke.h:61)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xA2E1F648: UnknownInlinedFun (EcalRecHit.h:128)
==865737==    by 0xA2E1F648: EcalRecHitSimpleAlgo::makeRecHit(EcalUncalibratedRecHit const&, float const&, float const&, unsigned int const&) const (EcalRecHitSimpleAlgo.h:46)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--

who may be also related to #44956

VinInn commented 5 months ago

looking into the detailed traceback I see this

alpaka_cuda_async::EcalMultifitConditionsHostESProducer::produce(EcalMultifitConditionsRcd const&) (EcalMultifitConditionsHostESProducer.cc:74)
VinInn commented 5 months ago

it seems that adding

diff --git a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
index ddba855853c..96a7792e9cb 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
@@ -72,6 +72,10 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
       size_t numberOfXtals = pedestalsData.size();

       auto product = std::make_unique<EcalMultifitConditionsHost>(numberOfXtals, cms::alpakatools::host());
+      {
+         alpaka::QueueCpuBlocking queue{cms::alpakatools::host()};
+         auto buff = product->buffer(); alpaka::memset(queue,buff, 0xa5);
+      }
       auto view = product->view();

       // Filling pedestals

it does not crash and all messages from Valgrind mentioned above disappear. Of course this is not necessarily a good news...

fwyzard commented 5 months ago

Reading the code of the ESProducer, the only field that seems to not be filled is the rawid.

@VinInn could you check if adding

diff --git a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
index ddba855853c..d0ee97230ec 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
@@ -91,6 +91,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {

       for (unsigned int i = 0; i < barrelSize; ++i) {
         auto vi = view[i];
+        vi.rawid() = 0;

         vi.pedestals_mean_x12() = pedestalsEB[i].mean_x12;
         vi.pedestals_rms_x12() = pedestalsEB[i].rms_x12;
@@ -113,6 +114,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
       }  // end Barrel loop
       for (unsigned int i = 0; i < endcapSize; ++i) {
         auto vi = view[barrelSize + i];
+        vi.rawid() = 0;

         vi.pedestals_mean_x12() = pedestalsEE[i].mean_x12;
         vi.pedestals_rms_x12() = pedestalsEE[i].rms_x12;

instead of the memset() also makes the valgrind messages go away ?

@thomreis should EcalMultifitConditionsHostESProducer fill the rawid ?

fwyzard commented 5 months ago

Mhm, grep does not find any mention of rawid in RecoLocalCalo/EcalRecProducers/ - so it doesn't seem to be used anywhere.

Then, I don't understand what un-initialised memory valgrind is complaining about :-(

VinInn commented 5 months ago

no it is not enough... Is not that there are some "padding space" for alignment and somehow it is accessed?

fwyzard commented 5 months ago

Yes, there is padding between the columns... but it should not be accessed, unless the indices used to access the SoA are larger than its nominal capacity.

VinInn commented 5 months ago

in RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h pedestal can be accessed with the channel index

   849            pedestal = conditionsDev.pedestals_mean_x12()[ch];

while everywhere else is accessed with the HashedIndex @thomreis : is this intended or is a bug?

869 auto const mean_x6 = conditionsDev.pedestals_mean_x6()[hashedId]; 870 auto const rms_x6 = conditionsDev.pedestals_rms_x6()[hashedId]; 871 auto const gain12Over6 = conditionsDev.gain12Over6()[hashedId]; 872 sample_value = (static_cast(adc) - mean_x6) gain12Over6; 873 sample_value_error = rms_x6 gain12Over6;

etc

in addition the hasindex is computed from the digi rawid 833 auto const hashedId = isBarrel ? ecal::reconstruction::hashedIndexEB(did.rawId()) 834 : offsetForHashes + ecal::reconstruction::hashedIndexEE(did.rawId());

and if did.rawId() is junk (as apparently can be, as seen in the companion issue) hashedId can easily be anything....

fwyzard commented 5 months ago

Very good points.

We have the possibility of enabling runtime checks on the indices, let me dig how to do it. I actually think we should actually enable it by default, if the cost is not too high.

fwyzard commented 5 months ago

Can you try running with https://github.com/cms-sw/cmssw/pull/44987 ?

VinInn commented 5 months ago

what is supposed to happen? anyhow no message

fwyzard commented 5 months ago

It's supposed to fail an assert or throw an exception if there is an out of bounds access (and if I made the correct changes).

fwyzard commented 4 months ago

Did https://github.com/cms-sw/cmssw/pull/45210 fix this ?