Closed swagata87 closed 2 years ago
A new Issue was created by @swagata87 Swagata Mukherjee.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign hlt, reconstruction
New categories assigned: hlt,reconstruction
@jpata,@missirol,@clacaputo,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks
@swagata87 could you provide the full stack traces for the job that failed with the segmentation violations?
Three examples are pasted below:
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Sat Jun 18 18:31:53 CEST 2022
Thread 1 (Thread 0x7fde7a331540 (LWP 194148) "cmsRun"):
#0 0x00007fde7c1d3ddd in poll () from /lib64/libc.so.6
#1 0x00007fde70bf428f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007fde70bf4c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007fde70bf756b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007fde7c2366a6 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007fddb6e786ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007fddb6e76fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007fde7ec2dd83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007fde7ec16eaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007fde7eb720e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007fde7eb723db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007fde7eb749c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007fde7eab8c45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007fde7d2c1b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7fddb5dd2300, this=0x7fde799da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fde799da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007fde7eae2ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007fde7eaed8fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007fde7d2b015b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: EcalRawToDigi:hltEcalDigisLegacy
Module: none
A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Tue Jun 14 06:45:22 CEST 2022
Thread 1 (Thread 0x7f1d0ef42540 (LWP 251002) "cmsRun"):
#0 0x00007f1d10de4ddd in poll () from /lib64/libc.so.6
#1 0x00007f1d057f428f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007f1d057f4c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007f1d057f756b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f1d10e45d29 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007f1c4b0876ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007f1c4b085fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007f1d1383fd83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007f1d13828eaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007f1d137840e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007f1d137843db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007f1d137869c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007f1d136cac45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007f1d11ed3b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f1c4a9a1500, this=0x7f1d0e5da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f1d0e5da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007f1d136f4ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007f1d136ff8fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007f1d11ec215b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: HcalHitReconstructor:hltHoreco
Module: HcalHitReconstructor:hltHoreco
A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Tue Jun 14 06:45:23 CEST 2022
Thread 1 (Thread 0x7f6148fd5540 (LWP 250893) "cmsRun"):
#0 0x00007f614ae77ddd in poll () from /lib64/libc.so.6
#1 0x00007f613f1f228f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007f613f1f2c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007f613f1f556b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f614aed8cb5 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007f60850e76ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007f60850e5fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007f614d8d4d83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007f614d8bdeaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007f614d8190e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007f614d8193db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007f614d81b9c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007f614d75fc45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007f614bf5fb8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f60849ad400, this=0x7f61485da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f61485da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007f614d789ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007f614d7948fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007f614bf4e15b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: CAHitNtupletCUDA:hltPixelTracksCPU
Module: none
Module: none
A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]
The full list is here:
Experts are working on providing a recipe to reproduce the crashes offline. (tagging @mzarucki and @fwyzard ) Once that is available, that can be posted here so that tracker DPG can have a look. The code that triggered the crashes are under tracker DPG.
Dear tracker DPG, (@cms-sw/trk-dpg-l2)
I managed to reproduce the GPU crash happened during run 353941
in the machine gputest-milan-01.cms
at Point 5.
I used CMSSW_12_3_5
.
$CMSSW_RELEASE_BASE/ is -bash: /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/: Is a directory
General instructions to set up CMSSW area in GPU nodes online, is here https://twiki.cern.ch/twiki/bin/viewauth/CMS/TriggerDevelopmentWithGPUs
The HLT configuration file is: https://swmukher.web.cern.ch/swmukher/hlt_v5.py
The .raw
file I ran on is this-> run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw
This .raw
file and all the other .raw
files are available in the online machines under /store/error_stream
.
I have copied one .raw
here: https://swmukher.web.cern.ch/swmukher/run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw
In case it is useful,
The HLT configuration file was obtained by the following command:
https_proxy=http://cmsproxy.cms:3128/ hltConfigFromDB --adg --configName /cdaq/physics/firstCollisions22/v2.4/HLT/V5 > hlt_v5.py
Then, at the end, the following block was added:
process.EvFDaqDirector = cms.Service(
"EvFDaqDirector",
runNumber=cms.untracked.uint32(353941), #maybe_replace_me
baseDir=cms.untracked.string("tmp"),
buBaseDir=cms.untracked.string(
"/nfshome0/swmukher/check/CMSSW_12_3_5/src" #replace_me
),
useFileBroker=cms.untracked.bool(False),
fileBrokerKeepAlive=cms.untracked.bool(True),
fileBrokerPort=cms.untracked.string("8080"),
fileBrokerUseLocalLock=cms.untracked.bool(True),
fuLockPollInterval=cms.untracked.uint32(2000),
requireTransfersPSet=cms.untracked.bool(False),
selectedTransferMode=cms.untracked.string(""),
mergingPset=cms.untracked.string(""),
outputAdler32Recheck=cms.untracked.bool(False),
)
process.source.fileNames = cms.untracked.vstring("file:run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw") #maybe_replace_me
process.source.fileListMode = True
cmsRun hlt_v5.py
reproduces the crash.
It will create a /tmp folder.
To reproduce the crash again, I had to remove the /tmp folder before doing cmsRun again.
Let me know if something was unclear.
@swagata87 thank you for providing these instructions !
@tsusa you can use the online GPU machines to reproduce the issue:
ssh gpu-c2a02-39-01.cms
mkdir -p /data/$USER
cd /data/$USER
source /data/cmssw/cmsset_default.sh
cmsrel CMSSW_12_3_5
cd CMSSW_12_3_5
mkdir run
cd run
cp ~hltpro/error/hlt_error_run353941.py .
cmsRun hlt_error_run353941.py
In my test the problem did not happen every time, I had to run the job a few times before it crashed:
while cmsRun hlt_error_run353941.py; do clear; rm -rf output; done
It eventually crashed, though I'm not 100% sure if it was due to the same problem :-/
Yes, looks like the same crash:
#4 <signal handler called>
#5 0x00007fbbf5c9f6a6 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007fbb34ed06ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl<SiPixelErrorsSoA, int, SiPixelErrorCompact const*, std::map<unsigned int, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map<unsigned int, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> > > > > const*&) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007fbb34ecefab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007fbbf8696d83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007fbbf867feaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
...
As a guess, I think the problem is an extremely large amount of data is being requested to be copied which leads to some memory overwrite into a protected memory space. This is just based on what edm::Event::emplaceImpl is doing which is basically calling
So cms::cuda::SimpleVector does not initialize any of its member data in its constructor
If the first call to SiPixelDigiErrorsSoAFromCUDA::acquire hits this condition https://github.com/cms-sw/cmssw/blob/d573dd29448b13dea818ed927bba7a63814ba29a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc#L54-L55
then this call in produce https://github.com/cms-sw/cmssw/blob/d573dd29448b13dea818ed927bba7a63814ba29a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc#L73
will just copy a random number of bytes from a random memory address.
@Dr15Jones thanks for investigating the issue.
So
cms::cuda::SimpleVector
does not initialize any of its member data in its constructor
This is intended, because a SimpleVector
is often allocated by the host in GPU memory, so the constructor cannot be run.
However the does leave open the possibility of using uninitialised memory :-(
A minimal fix could be
diff --git a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc b/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
index 4037b4d5061..554f1425cef 100644
--- a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
+++ b/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
@@ -28,7 +28,7 @@ private:
edm::EDPutTokenT<SiPixelErrorsSoA> digiErrorPutToken_;
cms::cuda::host::unique_ptr<SiPixelErrorCompact[]> data_;
- cms::cuda::SimpleVector<SiPixelErrorCompact> error_;
+ cms::cuda::SimpleVector<SiPixelErrorCompact> error_ = cms::cuda::make_SimpleVector<SiPixelErrorCompact>(0, nullptr);
const SiPixelFormatterErrors* formatterErrors_ = nullptr;
};
With it I have been able to run over 20 times on the same input as before without triggering any errors.
PRs with this fix:
Hm, looks like I am late to the party... but, if it's any help, here are instructions for the error seen in Run 353744 (AFAICT you have been testing with Run 353941). Running in Hilton this time:
Input file: file:/nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/run353744_ls0009.root
CMSSW: CMSSW_12_3_5
GT: 123X_dataRun3_HLT_v7
Menu: /cdaq/physics/firstCollisions22/v2.4/HLT/V2
I also see the same problem, it crashes only every once in a while. It's probably the same bug, but I add it here for completeness.
I also have here the other crash, this one is fully reproducible:
Input file: file:/nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/run353709_ls0085.root
CMSSW: CMSSW_12_3_5
GT: 123X_dataRun3_HLT_v7
Menu: /cdaq/physics/firstCollisions22/v2.4/HLT/V2
It will always crash on the 52nd event, Run 353709, Event 76567528, LumiSection 85
, with the message:
cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_5-slc7_amd64_gcc10/build/CMSSW_12_3_5-build/tmp/BUILDROOT/32f4c0d8c5d5ff0fb0f1b58023d4424d/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray<short unsigned int, 48> >; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc<unsigned int, 32769, 163840>; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray<unsigned int, 6>]: Assertion `tmpNtuplet.size() <= 4' failed.
PS: it's not needed to run on Hilton at all, I was running in offline-like mode.
@trtomei could you clarify
/store/error_stream/run353709/run353709_ls0085_index000141_fu-c2a05-35-01_pid90386.raw
Running online, I have not been able to reproduce the error using the .raw
input file, neither with nor without GPUs.
@fwyzard To clarify:
/store/error_stream/run353709/run353709_ls0085_index000141_fu-c2a05-35-01_pid90386.raw
hilton-c2e36-35-04
), using the process.options = cms.untracked.PSet( accelerators = cms.untracked.vstring( '*' ) )
option, and I see the lines
%MSG-i CUDAService: (NoModuleName) 26-Jun-2022 12:24:47 pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
%MSG
For me, the error happens consistently.
Maybe sit together with me tomorrow and we solve this.
@swagata87 @trtomei
Is this issue still relevant?
Is this issue still relevant?
actually, yesterday we had a crash which looks like the type1
crash mentioned in the issue-description.
Here are some relevant information on yesterday's crash:
Run number: 360224
StartTime: Oct 12 2022, 02:52
EndTime: Oct 12 2022, 04:36
HLT Menu: /cdaq/physics/Run2022/2e34/v1.4.1/HLT/V1
CMSSW_12_4_9
Crash happened in: fu-c2b05-23-01
The error stream file has been copied to hilton. So I think FOG will check if it is reproducible or not, and will follow up.
cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_9-el8_amd64_gcc10/build/CMSSW_12_4_9-build/tmp/BUILDROOT/dc6747a684df926e1faea7ef7c301e1a/opt/cmssw/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_9/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray]: Assertion `tmpNtuplet.size() <= 4' failed.
The files in ROOT format and the HLT configuration are in: /afs/cern.ch/user/t/tomei/public/issue38453
This is reproducible in the Hilton with GPU:
%MSG-i ThreadStreamSetup: (NoModuleName) 14-Oct-2022 02:05:46 pre-events
setting # threads 4
setting # streams 4
%MSG
%MSG-i CUDAService: (NoModuleName) 14-Oct-2022 02:05:47 pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
@cms-sw/tracking-pog-l2
In this issue, one HLT crash is not yet solved, and I would say we need help from tracking experts in order to find a fix.
The crash is reproducible offline (see https://github.com/cms-sw/cmssw/issues/38453#issuecomment-1278364360), it comes from the (HLT) pixel reconstruction, and it only happens on CPU, not on GPU (for what we have seen so far).
Removing some assert
calls, one can find a tmpNtuplet
with size=5, but that's as far as my insight goes.
https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L293
I have a vague recollection of a comment from @VinInn sayng that we should simply remove the assert
...
I think now it's OK to have ntuplets with 5 hits, so an alternative could be to change the condition to <= 5
?
At least, removing the assert
s [1,2] does not lead to any other crashes, fwiw.
And just for my understanding: the fact that, for the same event, we do not see a ntuplet with size=5 on GPU can be expected?
[1] https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L293 [2] https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L334
It does not happen on GPU because assert are removed. this is a sort of sextuplet candidate (rare? impossible?) anyhow if on GPU does not cause havoc I would either change the condition following @fwyzard advice or just remove the assert. mind the assert at the end of the function as well
It does not happen on GPU because assert are removed.
Okay, thanks, but still I tried to just print the ntuplet size while running on GPU, and I didn't see a size=5..
will try to have a look on cpu how this track looks like (and if possible compare to the one on GPU)
Thanks for having a look.
I checked that (unsurprisingly) the HLT runs fine on these 'error events', for both CPU and GPU, after changing the 4 to a 5 in the asserts, so in the meantime I'll open PRs with that change to gain time.
The PRs with the 4 -> 5
change are #39780 (12_6_X), #39781 (12_5_X), and #39782 (12_4_X).
@cms-sw/hlt-l2 (now speaking with the ORM hat, in order to better coordinate the creation of the next patch releases):
will this issue be fully solved after the merge of the backports of https://github.com/cms-sw/cmssw/pull/39780 ?
Yes, that is my understanding.
is there any other outstanding HLT crash with recent data that still needs to followed up (outside of this ticket)?
There are two more issues, but those crashes have been rare: #39568 , which ECAL has promised to look into, and #38651, which might somehow have been a glitch (seen only once).
FOG (@trtomei) can tell us if there are any new online crashes without a CMSSW issue.
+hlt
so the tuplet in question is joining layer-pairs 0,3,10,7,12, so all 6 BPIX1,2,3 and FPIX1,2,3 geometrically (almost) impossible but ok. Now why not on GPU?
how can I run hlt_for_debug.py on GPU and NOT on CPU?
Anyhow if we "observe" sextuplets we need to allow sextuplets in the code .... so the fix of the asserts is ok (the arrays were already over-dimensioned)
The sextuplet is on GPU as well
in case you are interested here are the coordinates of the hits
CPU 0,3,10,7,12, r/z: 2.834839/0.714075,6.584036/-13.839200,10.662227/-29.628742,11.603213/-33.272655,13.539767/-40.767979,15.999158/-50.250683,
GPU 0,3,10,7,12, r/z: 2.834839/0.714075,6.584036/-13.839200,10.662227/-29.628742,11.603212/-33.272655,13.539767/-40.767979,15.999158/-50.250683,
how can I run hlt_for_debug.py on GPU and NOT on CPU?
Looks like this was already solved. I add one comment for documentation purposes.
The complication comes from the fact that the HLT menu includes 2 prescaled triggers that run the pixel CPU-only reco (which is why we saw the crash online). To ensure that only the pixel GPU reco is running, one solution is to remove them, but that's tricky to do starting from the full menu [1]; alternatively, one can just run 1 appropriate Path instead of the full menu (most times, this is enough for a reproducer) [2]. In the future, we/HLT should maybe try to build 'minimal' reproducers, e.g. not using the full menu if that's not needed.
[1] Add at the end of hlt_for_debug.py
:
del process.DQM_PixelReconstruction_v4
del process.AlCa_PFJet40_CPUOnly_v1
del process.HLT_PFJet40_GPUvsCPU_v1
process.hltMuonTriggerResultsFilter.triggerConditions = ['FALSE']
del process.PrescaleService
del process.DQMHistograms
dpaths = [foo for foo in process.paths_() if foo.startswith('Dataset_')]
for foo in dpaths: process.__delattr__(foo)
fpaths = [foo for foo in process.finalpaths_()]
for foo in fpaths: process.__delattr__(foo)
[2] In this case, it could have been
hltGetConfiguration run:360224 \
--data \
--no-prescale \
--no-output \
--globaltag 124X_dataRun3_HLT_v4 \
--paths AlCa_PFJet40_v* \
--max-events -1 \
--input file:run360224_ls0081_file1.root \
> hlt.py
cat <<@EOF >> hlt.py
process.MessageLogger.cerr.FwkReport.limit = 1000
process.options.numberOfThreads = 1
process.options.accelerators = ['cpu']
@EOF
cmsRun hlt.py &> hlt.log
in case you are interested here are the coordinates of the hits
If it's not too much trouble to explain, I would be interested to know how to extract the information on layer pairs and r-z coordinates for a given candidate.
I see the crash at Run 360224, Event 82169671, LumiSection 81
; if I run CPU-only, I can see tmpNtuplet.size == 5
inside find_ntuplets
, but I don't see that if I run the same event on GPU, and this got me confused.
diff --git a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
index 4ec7069ac8e..a33ab98ca09 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
+++ b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
@@ -290,7 +290,19 @@ public:
auto doubletId = this - cells;
tmpNtuplet.push_back_unsafe(doubletId);
- assert(tmpNtuplet.size() <= 4);
+ assert(tmpNtuplet.size() <= 5);
+ if (tmpNtuplet.size()>4) {
+#ifdef __CUDACC__
+ printf("GPU ");
+#else
+ printf("CPU ");
+#endif
+ for (auto c : tmpNtuplet) printf("%d,",cells[c].theLayerPairId_);
+ printf(" r/z: ");
+ for (auto c : tmpNtuplet) printf("%f/%f,", cells[c].theInnerR,cells[c].theInnerZ);
+ auto c = tmpNtuplet[tmpNtuplet.size()-1]; printf("%f/%f,",cells[c].outer_r(hh),cells[c].outer_z(hh));
+ printf("\n");
+ }
bool last = true;
for (unsigned int otherCell : outerNeighbors()) {
@@ -331,7 +343,7 @@ public:
}
}
tmpNtuplet.pop_back();
- assert(tmpNtuplet.size() < 4);
+ assert(tmpNtuplet.size() < 5);
}
// Cell status management
I saw the printout twice so I added the ifdef part
btw the method .back()
of VecArray is badly broken (does not compile because cannot compile)
Thanks a lot for the info.
(This issue is solved; the rest below is just me trying to learn things.)
With Vincenzo's diff, I get what he wrote: same sextuplet on CPU and GPU.
In my previous attempts, I had additional printouts in GPUCACell::find_ntuplets
, and in that case I couldn't see the sextuplet on GPU. I think this is somewhat reproducible: I ran 30 times with this diff [*] and I could see the sextuplet on GPU in the printouts only 2 times (on CPU, I saw it 10 times out of 10).
At least now I see what I was doing differently.
[*] (yes, most of these printouts are pointless)
diff --git a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
index 4ec7069ac8e..bfefdf7ccd6 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
+++ b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
@@ -290,15 +290,37 @@ public:
auto doubletId = this - cells;
tmpNtuplet.push_back_unsafe(doubletId);
- assert(tmpNtuplet.size() <= 4);
+ assert(tmpNtuplet.size() <= 5);
+ if (tmpNtuplet.size()>4) {
+#ifdef __CUDACC__
+ printf("GPU ");
+#else
+ printf("CPU ");
+#endif
+ for (auto c : tmpNtuplet) printf("%d,",cells[c].theLayerPairId_);
+ printf(" r/z: ");
+ for (auto c : tmpNtuplet) printf("%f/%f,", cells[c].theInnerR,cells[c].theInnerZ);
+ auto c = tmpNtuplet[tmpNtuplet.size()-1]; printf("%f/%f,",cells[c].outer_r(hh),cells[c].outer_z(hh));
+ printf("\n");
+ }
bool last = true;
for (unsigned int otherCell : outerNeighbors()) {
if (cells[otherCell].isKilled())
continue; // killed by earlyFishbone
last = false;
+#ifdef __CUDACC__
+ printf("GPU1 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+ printf("CPU1 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
cells[otherCell].find_ntuplets<DEPTH - 1>(
hh, cells, cellTracks, foundNtuplets, apc, quality, tmpNtuplet, minHitsPerNtuplet, startAt0);
+#ifdef __CUDACC__
+ printf("GPU2 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+ printf("CPU2 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
}
if (last) { // if long enough save...
if ((unsigned int)(tmpNtuplet.size()) >= minHitsPerNtuplet - 1) {
@@ -331,7 +353,12 @@ public:
}
}
tmpNtuplet.pop_back();
- assert(tmpNtuplet.size() < 4);
+ assert(tmpNtuplet.size() < 5);
+#ifdef __CUDACC__
+ printf("GPU3 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+ printf("CPU3 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
}
// Cell status management
In my previous attempts, I had additional printouts in GPUCACell::find_ntuplets, and in that case I couldn't see the sextuplet on GPU. I think this is somewhat reproducible: I ran 30 times with this diff [*] and I could see the sextuplet on GPU in the printouts only 2 times (on CPU, I saw it 10 times out of 10). This is surprising as we do not expect GPU vs CPU differences at this point of processing will try to investigate more
@missirol
I'm sorry. running on patatrack02 even changing GPU I observed always (6 out of 6) both GPU and CPU printout.
1) on which machine are you running?
2) how exactly are you switching between cpu and gpu?
(in my case I'm just running cmsRun hlt_for_debug.py
from a copy of issue38453 directory using CMSSW_12_4_10_patch2)
btw: printf from GPU is not guaranteed to appear if there are too many.
running on patatrack02 even changing GPU I observed always (6 out of 6) both GPU and CPU printout.
Sorry for the trouble, then. I tested on gpu-c2a02-39-03
; to run CPU-only, I add process.options.accelerators = ['cpu']
to the config [*]. I've been using 12_4_10
with the diff in https://github.com/cms-sw/cmssw/issues/38453#issuecomment-1286135642 (will re-try with 12_4_10_patch2
, but that likely makes no difference).
btw: printf from GPU is not guaranteed to appear if there are too many.
Thanks, didn't know, it might explain what I (didn't) see.
[*]
https_proxy=http://cmsproxy.cms:3128 \
hltGetConfiguration run:360224 \
--data \
--no-prescale \
--no-output \
--globaltag 124X_dataRun3_HLT_v4 \
--paths AlCa_PFJet40_v* \
--max-events -1 \
--input file:run360224_ls0081_file1.root \
> hlt.py
cat <<@EOF >> hlt.py
process.MessageLogger.cerr.FwkReport.limit = 1000
process.options.numberOfThreads = 1
#process.options.accelerators = ['cpu']
@EOF
cmsRun hlt.py &> hlt.log
btw: printf from GPU is not guaranteed to appear if there are too many.
I think this is indeed the explaination [*]. Case closed, and sorry again for the noise.
[*] I checked this by keeping the large number of printouts, but also adding
#ifdef __CUDACC__
if (tmpNtuplet.size() > 4) {
__trap();
}
#endif
and the program crashed 10/10 times on GPU (running only on the event in question), meaning each time there was a sextuplet on GPU.
@swagata87 @missirol can this issue be considered concluded, and therefore closed?
In my understanding, yes (I signed it). Swagata can confirm and close.
yes, I am closing this issue. Thanks everyone!
Dear experts,
During the week of June 13-20, following 3 types of HLT crashes happened in collision runs. HLT was using
CMSSW_12_3_5
.1)
type 1
This crash happened on June 13th, during stable beams, collision at 900 GeV. Run number: 353709. The crash happened in a CPU(fu-c2a05-35-01). Elog: http://cmsonline.cern.ch/cms-elog/1143438. Full crash report: https://swmukher.web.cern.ch/swmukher/hltcrash_June13_StableBeam.txt
2)
type 2
This type of crashes happened in GPUs (for example: fu-c2a02-35-01). It happened during collision runs when no real collisions were happening. On June 14th (run 353744, Pixel subdetector was out), and on June 18th (run 353932, 353935, 353941, Pixel and tracker subdetectors were out).
3)
type 3
happened in fu-c2a02-39-01 (GPU), in collision run 353941 (Pixel and tracker subdetectors were out), no real collision was ongoing.
Reason of crash (2) and (3) might even be related. Relevant elog on (2) and (3): http://cmsonline.cern.ch/cms-elog/1143515
Regards, Swagata, as HLT DOC during June 13-20.