HLT Farm crashes in run 378940

wonpoint4 commented 7 months ago

Report the large numbers of GPU-related HLT crashes yesterday night (elog)

Related to illegal memory access
Not fully understood as HLT menus were unchanged with respect to the previous runs

Here's the recipe how to reproduce the crashes. (tested with CMSSW_14_0_4 on lxplus8-gpu)

cmsrel CMSSW_14_0_4
cd CMSSW_14_0_4/src
cmsenv

https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 378940 > hlt_run378940.py
cat <<@EOF >> hlt_run378940.py
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
    buBaseDir = '/eos/cms/store/group/phys_muon/wjun/error_stream',
    runNumber = 378940
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
    fileListMode = True,
    fileNames = (
        '/eos/cms/store/group/phys_muon/wjun/error_stream/run378940/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.raw',
    )
)
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

mkdir run378940
cmsRun hlt_run378940.py &> crash_run378940.log

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI

cmsbuild commented 7 months ago

cms-bot internal usage

cmsbuild commented 7 months ago

A new Issue was created by @wonpoint4.

@makortel, @sextonkennedy, @antoniovilela, @Dr15Jones, @smuzaffar, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 7 months ago

assign hlt, heterogeneous

cmsbuild commented 7 months ago

New categories assigned: hlt,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich commented 7 months ago

executing the reproducer with CUDA_LAUNCH_BLOCKING=1

I see the following stack:

Fri Apr  5 15:43:25 CEST 2024
Thread 12 (Thread 0x7f8afb1ff640 (LWP 1360807) "cmsRun"):
#0  0x00007f8b5934291f in poll () from /lib64/libc.so.6
#1  0x00007f8b52e5a62f in full_read.constprop () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f8b52e0ee3c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f8b52e0f7a0 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f8b592a154c in __pthread_kill_implementation () from /lib64/libc.so.6
#6  0x00007f8b59254d06 in raise () from /lib64/libc.so.6
#7  0x00007f8b592287f3 in abort () from /lib64/libc.so.6
#8  0x00007f8b596aeeea in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:50
#9  0x00007f8b596ace6a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#10 0x00007f8b596abed9 in __cxa_call_terminate (ue_header=0x7f8af8d143c0) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#11 0x00007f8b596ac5f6 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=<optimized out>, context=0x7f8afb1f88e0) at ../../../../lib\
stdc++-v3/libsupc++/eh_personality.cc:688
#12 0x00007f8b5a11b864 in _Unwind_RaiseException_Phase2 (exc=0x7f8af8d143c0, context=0x7f8afb1f88e0, frames_p=0x7f8afb1f87e8) at ../../../libgcc/unwind.inc:64
#13 0x00007f8b5a11c2bd in _Unwind_Resume (exc=0x7f8af8d143c0) at ../../../libgcc/unwind.inc:242
#14 0x00007f8af1e2e5aa in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> >::free(void*) [clone .cold] () from /cvmfs\
/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.so
#15 0x00007f8af1e3bc68 in std::_Sp_counted_ptr_inplace<alpaka::detail::BufCpuImpl<std::byte, std::integral_constant<unsigned long, 1ul>, unsigned int>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>:\
:_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.so
#16 0x00007f8af1e30f17 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFCl\
usterProducersPluginsPortableCudaAsync.so
#17 0x00007f8af1e500bc in std::_Sp_counted_ptr_inplace<std::tuple<PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >, std::shared_ptr<alpaka_cuda_async::EDMetadata> >, std::allocator\
<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.s\
o
#18 0x00007f8af1e30f17 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFCl\
usterProducersPluginsPortableCudaAsync.so
#19 0x00007f8af1e4b4bc in std::any::_Manager_external<std::shared_ptr<std::tuple<PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >, std::shared_ptr<alpaka_cuda_async::EDMetadata> > \
> >::_S_manage(std::any::_Op, std::any const*, std::any::_Arg*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPorta\
bleCudaAsync.so
#20 0x00007f8b5b6186ca in std::_Sp_counted_ptr_inplace<std::any, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9\
_amd64_gcc12/libFWCoreFramework.so

@cms-sw/pf-l2 FYI

fwyzard commented 7 months ago

From the stack trace, it seems that an exception was thrown while another exception was being handled:

#6  0x00007f8b59254d06 in raise () from /lib64/libc.so.6

while

#12 0x00007f8b5a11b864 in _Unwind_RaiseException_Phase2 (exc=0x7f8af8d143c0, context=0x7f8afb1f88e0, frames_p=0x7f8afb1f87e8) at ../../../libgcc/unwind.inc:64
#13 0x00007f8b5a11c2bd in _Unwind_Resume (exc=0x7f8af8d143c0) at ../../../libgcc/unwind.inc:242

@mmusich, if you have time to look into this further, could you try running with a single stream / single thread, and post the full stack trace ?

mmusich commented 7 months ago

if you have time to look into this further, could you try running with a single stream / single thread, and post the full stack trace ?

sure. Adding to the configuration file

process.options.numberOfThreads = 1
process.options.numberOfStreams = 1

I get the following stack attached: crash_run378940.log I notice right before the stack trace:

At the end of topoClusterContraction, found large *pcrhFracSize = 2220194
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/EventUniformCu
daHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/mem/buf/BufUniformCu
daHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
terminate called after throwing an instance of 'std::runtime_error'
  what():  /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/Eve
ntUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/EventUniformCu
daHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!

fwyzard commented 7 months ago

Thanks.

So, "cudaErrorIllegalAddress" is basically the GPU equivalent of "Segmentation violation" :-(

What happens with the stack trace is that once we hit a CUDA error, we raise an exception and start unwinding the stack. While doing that we try to free some CUDA memory, but that call to do that also fails (because cudaErrorIllegalAddress is still present), which triggers a second exception. And the second exception cannot be handled, which causes the abort.

Of course this doesn't explain the reason for the error that we hit in the first place... that will need to be debugged.

missirol commented 7 months ago

Here's a second reproducer (same input events). I see the seg-fault when running on CPU only too.

#!/bin/bash -ex

# CMSSW_14_0_4

hltGetConfiguration run:378940 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240405_run378940/files/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

process.options.accelerators = ["*"]
@EOF

CUDA_LAUNCH_BLOCKING=1 \
cmsRun hlt.py &> hlt.log

Stack trace here: hlt.log.

Thread 1 (Thread 0x7f44a0bac640 (LWP 3012403) "cmsRun"):
#0  0x00007f44a1779301 in poll () from /lib64/libc.so.6
#1  0x00007f44967d56af in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f4496789dbc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f449678a720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f445fc94340 in void alpaka_serial_sync::FastCluster::operator()<false, alpaka::AccCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int>, std::enable_if<false, void> >(alpaka::AccCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int> const&, reco::PFRecHitSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusterParamsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFRecHitHCALTopologySoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusteringVarsSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusterSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>, reco::PFRecHitFractionSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>) const [clone .constprop.0] [clone .isra.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#6  0x00007f445fc95904 in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka_serial_sync::FastCluster, reco::PFRecHitSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFClusterParamsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFRecHitHCALTopologySoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFClusteringVarsSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&, reco::PFClusterSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&, reco::PFRecHitFractionSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&>::operator()() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#7  0x00007f445fc9735f in alpaka_serial_sync::PFClusterProducerKernel::execute(alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&, PortableHostCollection<reco::PFClusterParamsSoALayout<128ul, false> > const&, PortableHostCollection<reco::PFRecHitHCALTopologySoALayout<128ul, false> > const&, PortableHostCollection<reco::PFClusteringVarsSoALayout<128ul, false> >&, PortableHostCollection<reco::PFClusteringEdgeVarsSoALayout<128ul, false> >&, PortableHostCollection<reco::PFRecHitSoALayout<128ul, false> > const&, PortableHostCollection<reco::PFClusterSoALayout<128ul, false> >&, PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#8  0x00007f445fc8ddf8 in alpaka_serial_sync::PFClusterSoAProducer::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#9  0x00007f445fc8c06d in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#10 0x00007f44a41d5e91 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#11 0x00007f44a41ba7ae in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#12 0x00007f44a4145669 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x00007f44a4145bd4 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x00007f44a42fbf28 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#15 0x00007f44a2901281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f449f4d3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f449f4d3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#17 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#18 0x00007f44a40c8ceb in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007f44a40d265a in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x00007f44a40d2bb1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00007f44a28ed9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#23 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040517c in main ()

Current Modules:

Module: PFClusterSoAProducer@alpaka:hltParticleFlowClusterHBHESoA (crashed)
Module: none

A fatal system signal has occurred: segmentation violation

mmusich commented 7 months ago

type pf

slava77 commented 7 months ago

would running in cuda-gdb help to get more info? The last time I used it, it was with CUDBG_USE_LEGACY_DEBUGGER=1 cuda-gdb cmsRun

slava77 commented 7 months ago

would running in cuda-gdb help to get more info? The last time I used it, it was with CUDBG_USE_LEGACY_DEBUGGER=1 cuda-gdb cmsRun

the trace was more informative when recompiled with --keep passed to nvcc

missirol commented 7 months ago

Just to note that (see https://github.com/cms-sw/cmssw/issues/44634#issuecomment-2040080445) I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course). In that case, the title of the issue should be updated. @wonpoint4

mmusich commented 7 months ago

I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course).

I was wondering if the warning I reported above

At the end of topoClusterContraction, found large *pcrhFracSize = 2220194

generated here:

https://github.com/cms-sw/cmssw/blob/f5861db9bb835bae35f4c34945f494e64ed06c35/RecoParticleFlow/PFClusterProducer/plugins/alpaka/PFClusterSoAProducerKernel.dev.cc#L1308-L1311

might give hints.

jsamudio commented 7 months ago

I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course).

I was wondering if the warning I reported above

At the end of topoClusterContraction, found large *pcrhFracSize = 2220194

generated here:

https://github.com/cms-sw/cmssw/blob/f5861db9bb835bae35f4c34945f494e64ed06c35/RecoParticleFlow/PFClusterProducer/plugins/alpaka/PFClusterSoAProducerKernel.dev.cc#L1308-L1311

might give hints.

It sort of makes sense to me that with this pcrhFracSize so large that there would be a crash. The rechit fraction SoA is probably not flexible up to this size and potentially some read/write to this SoA is causing the segfault and cuda error.

I am still investigating in the PF Alpaka Kernel since this number of rechit fractions seems strangely large when preceding events look more reasonable.

fwyzard commented 7 months ago

I'm guessing that

if pfClusteringVars.pcrhFracSize() is larger than 200000, at some point we had offsets larger than 200000 (see line 1286)
which means we had pfClusteringVars[rhIdx].seedFracOffsets() larger than 200000 (see line 1289)
which means we tried to access the fracView SoA with an index larger than 200000 (in many places)

@jsamudio could you check what is the actual SoA size in the event where the crash happens ?

If this is overflow is the cause of the crash - what can be done to avoid it ? I do not mean in the sense of improving the algorithms, I mean from a technical point of view. Would it be possible to add a check inside the kernel that computes the offset and make it fail with an explicit error if the size of the SoA is not large enough - but without crashing or stopping the job, only skipping the offending event ?

jsamudio commented 7 months ago

In the event where we see the crash we have 11,244 PF rechits, and the current allocation is nRecHits * 120, so the fraction SoA would have 1,349,280 elements. Here then 2,220,194 is obviously outside this.

As for adding an error and skipping the event, I understand the idea, but I don't know if I've seen an example of something similar to this before. Perhaps someone else has and could point me to an implementation?

fwyzard commented 7 months ago

As a quick workaround, would it work to increase the 120 to something like 250 in the HLT menu ?

Not as a long term solution, but to eliminate or at least reduce the online crashes, while a better solution is being investigated.

mmusich commented 7 months ago

As a quick workaround, would it work to increase the 120 to something like 250 in the HLT menu ?

Would this entail a configuration change or change in the code (new online release)?

fwyzard commented 7 months ago

I think it's a configuration parameter.

mmusich commented 7 months ago

Would this entail a configuration change or change in the code (new online release)?

answering myself:

process.hltParticleFlowClusterHBHESoA = cms.EDProducer( "PFClusterSoAProducer@alpaka",
    pfRecHits = cms.InputTag( "hltParticleFlowRecHitHBHESoA" ),
    pfClusterParams = cms.ESInputTag( "hltESPPFClusterParams","" ),
    topology = cms.ESInputTag( "hltESPPFRecHitHCALTopology","" ),
    synchronise = cms.bool( False ),
-    pfRecHitFractionAllocation = cms.int32( 120 ),
+    pfRecHitFractionAllocation = cms.int32( 250 ),
    alpaka = cms.untracked.PSet(  backend = cms.untracked.string( "" ) )
)

missirol commented 7 months ago

FTR, I double-checked that https://github.com/cms-sw/cmssw/issues/44634#issuecomment-2041020088 avoids the crash in the reproducer, and the HLT throughput is not affected, so it looks like a good short-term solution.

Two extra notes.

The change would have to be done to 2 modules: hltParticleFlowClusterHBHESoA and its serial-sync counterpart.
Afaik, these crashes have not occurred during stable beams yet. Run-378940 was a short run during a Fill for "beam loss maps" (I wonder what was going on in this event..).

makortel commented 7 months ago

I took a stab on trying to have the error(s) reported properly via exceptions rather than crashes (caused by exceptions being thrown during stack unwinding caused by an exception). https://github.com/cms-sw/cmssw/pull/44730 should improve the situation (especially when running with CUDA_LAUNCH_BLOCKING=1), although it doesn't completely prevent the crashes (that, at least in the case of the reproducer in this issue, come from direct CUDA code; that might not be worth of the effort trying address at this point).

While developing the PR I started to wonder if Alpaka-specific exception type (or GPU runtime specific? or cms::Exception category+exit code?) would be useful to quickly disambiguate the GPU-related errors from the rest (although it might be useful to spin off that discussion into its own issue).

mmusich commented 7 months ago

for the record this was also tracked at https://its.cern.ch/jira/browse/CMSHLT-3144

mmusich commented 1 month ago

Proposed solutions:

https://github.com/cms-sw/cmssw/pull/46135 (master, CMSSW_14_2_X)
https://github.com/cms-sw/cmssw/pull/46136 (CMSSW_14_1_X)

In a CMSSW_14_0_15_patch1 + this commit [1], I've tested that the following script:

#!/bin/bash -ex

#in CMSSW_14_0_15_patch1

hltGetConfiguration run:378940 \
            --globaltag 140X_dataRun3_HLT_v3 \
            --data \
            --no-prescale \
            --no-output \
            --max-events -1 \
            --input /store/group/tsg/FOG/error_stream_root/run378940/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.root > hlt_378940.py

cat <<@EOF >> hlt_378940.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt_378940.py &> hlt_378940.log

was still failing with the following messages:

At the end of topoClusterContraction, found large *pcrhFracSize = 2220194
At the end of topoClusterContraction, found large *pcrhFracSize = 2213019
Out of range index in ViewTemplateFreeParams::operator[]
  [...]
  /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUnifor\
mCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/BufUnifor\
mCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUnifor\
mCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/BufUnifor\
mCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
----- Begin Fatal Exception 07-Oct-2024 10:58:20 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 378940 lumi: 21 event: 5339574 stream: 0
   [1] Running path 'DQM_HcalReconstruction_v7'
   [2] Calling method for module alpaka_serial_sync::PFClusterSoAProducer/'hltParticleFlowClusterHBHESoACPUSerial'
Exception Message:
A std::exception was thrown.
Out of range index in ViewTemplateFreeParams::operator[]
----- End Fatal Exception -------------------------------------------------

whereas cherry-picking the commits from PR https://github.com/cms-sw/cmssw/pull/46136/ the job successfully finishes.

[1]

Click me

```diff commit e119a60a1e01b4fe2f6444f43787ea92cc4f1911 Author: mmusich Date: Mon Oct 7 11:01:42 2024 +0200 re-introduce customizeHLTfor44591 diff --git a/HLTrigger/Configuration/python/customizeHLTforCMSSW.py b/HLTrigger/Configuration/python/customizeHLTforCMSSW.py index f44657dfa5f..83e2966d8e0 100644 --- a/HLTrigger/Configuration/python/customizeHLTforCMSSW.py +++ b/HLTrigger/Configuration/python/customizeHLTforCMSSW.py @@ -261,6 +261,17 @@ def checkHLTfor43774(process): return process + +def customizeHLTfor44591(process): + """ + Customisation for running HLT with the updated btag info producers from the PR 44591 + """ + for type in ["DeepFlavourTagInfoProducer", "ParticleTransformerAK4TagInfoProducer", "DeepBoostedJetTagInfoProducer"]: + for producer in producers_by_type(process, type): + if hasattr(producer, 'unsubjet_map'): + delattr(producer, 'unsubjet_map') + return process + # CMSSW version specific customizations def customizeHLTforCMSSW(process, menuType="GRun"): @@ -270,5 +281,6 @@ def customizeHLTforCMSSW(process, menuType="GRun"): # process = customiseFor12718(process) process = checkHLTfor43774(process) - + process = customizeHLTfor44591(process) + return process ```

mmusich commented 1 month ago

+hlt

see https://github.com/cms-sw/cmssw/issues/44634#issuecomment-2396418097

fwyzard commented 1 month ago

+1

cmsbuild commented 1 month ago

This issue is fully signed and ready to be closed.

makortel commented 1 month ago

@cmsbuild, please close

cms-sw / cmssw

HLT Farm crashes in run 378940 #44634