Closed wonpoint4 closed 1 month ago
cms-bot internal usage
A new Issue was created by @wonpoint4.
@makortel, @sextonkennedy, @antoniovilela, @Dr15Jones, @smuzaffar, @rappoccio can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign hlt, heterogeneous
New categories assigned: hlt,heterogeneous
@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
executing the reproducer with CUDA_LAUNCH_BLOCKING=1
I see the following stack:
Fri Apr 5 15:43:25 CEST 2024
Thread 12 (Thread 0x7f8afb1ff640 (LWP 1360807) "cmsRun"):
#0 0x00007f8b5934291f in poll () from /lib64/libc.so.6
#1 0x00007f8b52e5a62f in full_read.constprop () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2 0x00007f8b52e0ee3c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3 0x00007f8b52e0f7a0 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f8b592a154c in __pthread_kill_implementation () from /lib64/libc.so.6
#6 0x00007f8b59254d06 in raise () from /lib64/libc.so.6
#7 0x00007f8b592287f3 in abort () from /lib64/libc.so.6
#8 0x00007f8b596aeeea in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:50
#9 0x00007f8b596ace6a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#10 0x00007f8b596abed9 in __cxa_call_terminate (ue_header=0x7f8af8d143c0) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#11 0x00007f8b596ac5f6 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=<optimized out>, context=0x7f8afb1f88e0) at ../../../../lib\
stdc++-v3/libsupc++/eh_personality.cc:688
#12 0x00007f8b5a11b864 in _Unwind_RaiseException_Phase2 (exc=0x7f8af8d143c0, context=0x7f8afb1f88e0, frames_p=0x7f8afb1f87e8) at ../../../libgcc/unwind.inc:64
#13 0x00007f8b5a11c2bd in _Unwind_Resume (exc=0x7f8af8d143c0) at ../../../libgcc/unwind.inc:242
#14 0x00007f8af1e2e5aa in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> >::free(void*) [clone .cold] () from /cvmfs\
/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.so
#15 0x00007f8af1e3bc68 in std::_Sp_counted_ptr_inplace<alpaka::detail::BufCpuImpl<std::byte, std::integral_constant<unsigned long, 1ul>, unsigned int>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>:\
:_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.so
#16 0x00007f8af1e30f17 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFCl\
usterProducersPluginsPortableCudaAsync.so
#17 0x00007f8af1e500bc in std::_Sp_counted_ptr_inplace<std::tuple<PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >, std::shared_ptr<alpaka_cuda_async::EDMetadata> >, std::allocator\
<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.s\
o
#18 0x00007f8af1e30f17 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFCl\
usterProducersPluginsPortableCudaAsync.so
#19 0x00007f8af1e4b4bc in std::any::_Manager_external<std::shared_ptr<std::tuple<PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >, std::shared_ptr<alpaka_cuda_async::EDMetadata> > \
> >::_S_manage(std::any::_Op, std::any const*, std::any::_Arg*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPorta\
bleCudaAsync.so
#20 0x00007f8b5b6186ca in std::_Sp_counted_ptr_inplace<std::any, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9\
_amd64_gcc12/libFWCoreFramework.so
@cms-sw/pf-l2 FYI
From the stack trace, it seems that an exception was thrown while another exception was being handled:
#6 0x00007f8b59254d06 in raise () from /lib64/libc.so.6
while
#12 0x00007f8b5a11b864 in _Unwind_RaiseException_Phase2 (exc=0x7f8af8d143c0, context=0x7f8afb1f88e0, frames_p=0x7f8afb1f87e8) at ../../../libgcc/unwind.inc:64
#13 0x00007f8b5a11c2bd in _Unwind_Resume (exc=0x7f8af8d143c0) at ../../../libgcc/unwind.inc:242
@mmusich, if you have time to look into this further, could you try running with a single stream / single thread, and post the full stack trace ?
if you have time to look into this further, could you try running with a single stream / single thread, and post the full stack trace ?
sure. Adding to the configuration file
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
I get the following stack attached: crash_run378940.log I notice right before the stack trace:
At the end of topoClusterContraction, found large *pcrhFracSize = 2220194
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/EventUniformCu
daHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/mem/buf/BufUniformCu
daHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
terminate called after throwing an instance of 'std::runtime_error'
what(): /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/Eve
ntUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())' returned error : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/EventUniformCu
daHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
Thanks.
So, "cudaErrorIllegalAddress" is basically the GPU equivalent of "Segmentation violation" :-(
What happens with the stack trace is that once we hit a CUDA error, we raise an exception and start unwinding the stack. While doing that we try to free some CUDA memory, but that call to do that also fails (because cudaErrorIllegalAddress
is still present), which triggers a second exception. And the second exception cannot be handled, which causes the abort
.
Of course this doesn't explain the reason for the error that we hit in the first place... that will need to be debugged.
Here's a second reproducer (same input events). I see the seg-fault when running on CPU only too.
#!/bin/bash -ex
# CMSSW_14_0_4
hltGetConfiguration run:378940 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/debug/240405_run378940/files/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.root \
> hlt.py
cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.accelerators = ["*"]
@EOF
CUDA_LAUNCH_BLOCKING=1 \
cmsRun hlt.py &> hlt.log
Stack trace here: hlt.log.
Thread 1 (Thread 0x7f44a0bac640 (LWP 3012403) "cmsRun"):
#0 0x00007f44a1779301 in poll () from /lib64/libc.so.6
#1 0x00007f44967d56af in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2 0x00007f4496789dbc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3 0x00007f449678a720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f445fc94340 in void alpaka_serial_sync::FastCluster::operator()<false, alpaka::AccCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int>, std::enable_if<false, void> >(alpaka::AccCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int> const&, reco::PFRecHitSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusterParamsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFRecHitHCALTopologySoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusteringVarsSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusterSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>, reco::PFRecHitFractionSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>) const [clone .constprop.0] [clone .isra.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#6 0x00007f445fc95904 in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka_serial_sync::FastCluster, reco::PFRecHitSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFClusterParamsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFRecHitHCALTopologySoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFClusteringVarsSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&, reco::PFClusterSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&, reco::PFRecHitFractionSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&>::operator()() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#7 0x00007f445fc9735f in alpaka_serial_sync::PFClusterProducerKernel::execute(alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&, PortableHostCollection<reco::PFClusterParamsSoALayout<128ul, false> > const&, PortableHostCollection<reco::PFRecHitHCALTopologySoALayout<128ul, false> > const&, PortableHostCollection<reco::PFClusteringVarsSoALayout<128ul, false> >&, PortableHostCollection<reco::PFClusteringEdgeVarsSoALayout<128ul, false> >&, PortableHostCollection<reco::PFRecHitSoALayout<128ul, false> > const&, PortableHostCollection<reco::PFClusterSoALayout<128ul, false> >&, PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#8 0x00007f445fc8ddf8 in alpaka_serial_sync::PFClusterSoAProducer::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#9 0x00007f445fc8c06d in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#10 0x00007f44a41d5e91 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#11 0x00007f44a41ba7ae in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#12 0x00007f44a4145669 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x00007f44a4145bd4 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x00007f44a42fbf28 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#15 0x00007f44a2901281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f449f4d3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f449f4d3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#17 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#18 0x00007f44a40c8ceb in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007f44a40d265a in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x00007f44a40d2bb1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00007f44a28ed9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#23 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040517c in main ()
Current Modules:
Module: PFClusterSoAProducer@alpaka:hltParticleFlowClusterHBHESoA (crashed)
Module: none
A fatal system signal has occurred: segmentation violation
type pf
would running in cuda-gdb
help to get more info?
The last time I used it, it was with CUDBG_USE_LEGACY_DEBUGGER=1 cuda-gdb cmsRun
would running in
cuda-gdb
help to get more info? The last time I used it, it was withCUDBG_USE_LEGACY_DEBUGGER=1 cuda-gdb cmsRun
the trace was more informative when recompiled with --keep
passed to nvcc
Just to note that (see https://github.com/cms-sw/cmssw/issues/44634#issuecomment-2040080445) I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course). In that case, the title of the issue should be updated. @wonpoint4
I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course).
I was wondering if the warning I reported above
At the end of topoClusterContraction, found large *pcrhFracSize = 2220194
generated here:
might give hints.
I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course).
I was wondering if the warning I reported above
At the end of topoClusterContraction, found large *pcrhFracSize = 2220194
generated here:
might give hints.
It sort of makes sense to me that with this pcrhFracSize
so large that there would be a crash. The rechit fraction SoA is probably not flexible up to this size and potentially some read/write to this SoA is causing the segfault and cuda error.
I am still investigating in the PF Alpaka Kernel since this number of rechit fractions seems strangely large when preceding events look more reasonable.
I'm guessing that
pfClusteringVars.pcrhFracSize()
is larger than 200000
, at some point we had offset
s larger than 200000
(see line 1286)pfClusteringVars[rhIdx].seedFracOffsets()
larger than 200000
(see line 1289)fracView
SoA with an index larger than 200000
(in many places)@jsamudio could you check what is the actual SoA size in the event where the crash happens ?
If this is overflow is the cause of the crash - what can be done to avoid it ?
I do not mean in the sense of improving the algorithms, I mean from a technical point of view.
Would it be possible to add a check inside the kernel that computes the offset
and make it fail with an explicit error if the size of the SoA is not large enough - but without crashing or stopping the job, only skipping the offending event ?
In the event where we see the crash we have 11,244
PF rechits, and the current allocation is nRecHits * 120
, so the fraction SoA would have 1,349,280
elements. Here then 2,220,194
is obviously outside this.
As for adding an error and skipping the event, I understand the idea, but I don't know if I've seen an example of something similar to this before. Perhaps someone else has and could point me to an implementation?
As a quick workaround, would it work to increase the 120 to something like 250 in the HLT menu ?
Not as a long term solution, but to eliminate or at least reduce the online crashes, while a better solution is being investigated.
As a quick workaround, would it work to increase the 120 to something like 250 in the HLT menu ?
Would this entail a configuration change or change in the code (new online release)?
I think it's a configuration parameter.
Would this entail a configuration change or change in the code (new online release)?
answering myself:
process.hltParticleFlowClusterHBHESoA = cms.EDProducer( "PFClusterSoAProducer@alpaka",
pfRecHits = cms.InputTag( "hltParticleFlowRecHitHBHESoA" ),
pfClusterParams = cms.ESInputTag( "hltESPPFClusterParams","" ),
topology = cms.ESInputTag( "hltESPPFRecHitHCALTopology","" ),
synchronise = cms.bool( False ),
- pfRecHitFractionAllocation = cms.int32( 120 ),
+ pfRecHitFractionAllocation = cms.int32( 250 ),
alpaka = cms.untracked.PSet( backend = cms.untracked.string( "" ) )
)
FTR, I double-checked that https://github.com/cms-sw/cmssw/issues/44634#issuecomment-2041020088 avoids the crash in the reproducer, and the HLT throughput is not affected, so it looks like a good short-term solution.
Two extra notes.
hltParticleFlowClusterHBHESoA
and its serial-sync counterpart.I took a stab on trying to have the error(s) reported properly via exceptions rather than crashes (caused by exceptions being thrown during stack unwinding caused by an exception). https://github.com/cms-sw/cmssw/pull/44730 should improve the situation (especially when running with CUDA_LAUNCH_BLOCKING=1
), although it doesn't completely prevent the crashes (that, at least in the case of the reproducer in this issue, come from direct CUDA code; that might not be worth of the effort trying address at this point).
While developing the PR I started to wonder if Alpaka-specific exception type (or GPU runtime specific? or cms::Exception
category+exit code?) would be useful to quickly disambiguate the GPU-related errors from the rest (although it might be useful to spin off that discussion into its own issue).
for the record this was also tracked at https://its.cern.ch/jira/browse/CMSHLT-3144
Proposed solutions:
CMSSW_14_2_X
)CMSSW_14_1_X
)In a CMSSW_14_0_15_patch1
+ this commit [1], I've tested that the following script:
#!/bin/bash -ex
#in CMSSW_14_0_15_patch1
hltGetConfiguration run:378940 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/error_stream_root/run378940/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.root > hlt_378940.py
cat <<@EOF >> hlt_378940.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
cmsRun hlt_378940.py &> hlt_378940.log
was still failing with the following messages:
At the end of topoClusterContraction, found large *pcrhFracSize = 2220194
At the end of topoClusterContraction, found large *pcrhFracSize = 2213019
Out of range index in ViewTemplateFreeParams::operator[]
[...]
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUnifor\
mCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/BufUnifor\
mCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUnifor\
mCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/BufUnifor\
mCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
----- Begin Fatal Exception 07-Oct-2024 10:58:20 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 378940 lumi: 21 event: 5339574 stream: 0
[1] Running path 'DQM_HcalReconstruction_v7'
[2] Calling method for module alpaka_serial_sync::PFClusterSoAProducer/'hltParticleFlowClusterHBHESoACPUSerial'
Exception Message:
A std::exception was thrown.
Out of range index in ViewTemplateFreeParams::operator[]
----- End Fatal Exception -------------------------------------------------
whereas cherry-picking the commits from PR https://github.com/cms-sw/cmssw/pull/46136/ the job successfully finishes.
[1]
+1
This issue is fully signed and ready to be closed.
@cmsbuild, please close
Report the large numbers of GPU-related HLT crashes yesterday night (elog)
Here's the recipe how to reproduce the crashes. (tested with
CMSSW_14_0_4
onlxplus8-gpu
)@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI