HLT crash in run 383631 - 'VertexException' and HcalDigisProducerGPU:hltHcalDigisGPU error

jalimena commented 1 month ago

In run 383631 (pp collisions, release CMSSW_14_0_11_MULTIARCHS), we got this error:

An exception of category 'VertexException' occurred while
[0] Processing Event run: 383631 lumi: 444 event: 1064710185 stream: 16
[1] Running path 'HLT_DoublePNetTauhPFJet30_Medium_L2NN_eta2p3_v4'
[2] Calling method for module PrimaryVertexProducer/'hltVerticesPF'
Exception Message:
BasicSingleVertexState::could not invert error matrix 

terminate called after throwing an instance of 'cms::Exception'
what(): An exception of category 'AsyncCallNotAllowed' occurred while
[0] Calling Async::run()
Exception Message:
Framework is shutting down, further run() calls are not allowed

the complete stack trace is attached, the thread that crashed is Thread 5 and the module that crashed is HcalDigisProducerGPU:hltHcalDigisGPU.

Thread 5 (Thread 0x7fde7f85a700 (LWP 679022) "cmsRun"): 
#0  0x00007fdf06c0e0e1 in poll () from /lib64/libc.so.6 
#1  0x00007fdeff3670cf in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so 
#2  0x00007fdeff31b1ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so 
#3  0x00007fdeff31b370 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so 
#4   
#5  0x00007fdf06b37aff in raise () from /lib64/libc.so.6 
#6  0x00007fdf06b0aea5 in abort () from /lib64/libc.so.6 
#7  0x00007fdf07514a49 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95 
#8  0x00007fdf0752006a in __cxxabiv1::__terminate (handler=) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 
#9  0x00007fdf0751f0d9 in __cxa_call_terminate (ue_header=0x7fdd574b88e0) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54 
#10 0x00007fdf0751f7f6 in __cxxabiv1::__gxx_personality_v0 (version=, actions=6, exception_class=5138137972254386944, ue_header=, context=0x7fde7f853500) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:688 
#11 0x00007fdf070e5864 in _Unwind_RaiseException_Phase2 (exc=0x7fdd574b88e0, context=0x7fde7f853500, frames_p=0x7fde7f853408) at ../../../libgcc/unwind.inc:64 
#12 0x00007fdf070e62bd in _Unwind_Resume (exc=0x7fdd574b88e0) at ../../../libgcc/unwind.inc:242 
#13 0x00007fde9ce05966 in cms::cuda::impl::ScopedContextHolderHelper::enqueueCallback(int, CUstream_st*) [clone .cold] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libHeterogeneousCoreCUDACore.so 
#14 0x00007fde9ce07fef in cms::cuda::ScopedContextAcquire::~ScopedContextAcquire() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libHeterogeneousCoreCUDACore.so 
#15 0x00007fde9829562b in HcalDigisProducerGPU::acquire(edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginEventFilterHcalRawToDigiGPUPlugins.so 
#16 0x00007fdf0966bee9 in edm::stream::doAcquireIfNeeded(edm::stream::impl::ExternalWork*, edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so 
#17 0x00007fdf096726d7 in edm::stream::EDProducerAdaptorBase::doAcquire(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*, edm::WaitingTaskWithArenaHolder&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so 
#18 0x00007fdf09645652 in edm::Worker::runAcquire(edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so 
#19 0x00007fdf096457d6 in edm::Worker::runAcquireAfterAsyncPrefetch(std::__exception_ptr::exception_ptr, edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so 
#20 0x00007fdf095e041f in edm::Worker::AcquireTask, void>::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so 
#21 0x00007fdf0978f1d8 in tbb::detail::d1::function_task::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so 
#22 0x00007fdf07d7f95b in tbb::detail::r1::task_dispatcher::local_wait_for_all (t=0x7fdd9ef34200, waiter=..., this=0x7fdf019a0500) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322 
#23 tbb::detail::r1::task_dispatcher::local_wait_for_all (t=0x0, waiter=..., this=0x7fdf019a0500) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458 
#24 tbb::detail::r1::arena::process (tls=..., this=) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:137 
#25 tbb::detail::r1::market::process (this=, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/market.cpp:599 
#26 0x00007fdf07d81b0e in tbb::detail::r1::rml::private_worker::run (this=0x7fdf01995080) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:271 
#27 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fdf01995080) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:221 
#28 0x00007fdf06eb71ca in start_thread () from /lib64/libpthread.so.0 
#29 0x00007fdf06b22e73 in clone () from /lib64/libc.so.6

Complete stack trace should be here: old_hlt_run383631_pid675389.log

I tried to reproduce it, but no success:

#in CMSSW_14_0_11_MULTIARCHS:

hltGetConfiguration run:383631 --globaltag 140X_dataRun3_HLT_v3 --data --no-prescale --no-output --max-events -1 --input 'file:run383631_ls0444_index000423_fu-c2b14-43-01_pid675389.root','file:run383631_ls0445_index000002_fu-c2b14-43-01_pid675389.root','file:run383631_ls0445_index000025_fu-c2b14-43-01_pid675389.root','file:run383631_ls0666_index000027_fu-c2b14-43-01_pid675067.root','file:run383631_ls0666_index000047_fu-c2b14-43-01_pid675067.root','file:run383631_ls0666_index000065_fu-c2b14-43-01_pid675067.root' > hlt_383631.py

cat <@EOF >> hlt_383631.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt_383631.py &> hlt_383631.log

Possibly related to https://github.com/cms-sw/cmssw/issues/41914

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI

Juliette, for FOG

cmsbuild commented 1 month ago

cms-bot internal usage

cmsbuild commented 1 month ago

A new Issue was created by @jalimena.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 1 month ago

For record, also using repacked data with the following script:

#!/bin/bash -ex

# CMSSW_14_0_11_MULTIARCHS

hltGetConfiguration run:383631 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input '/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0444_index000423_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index000025_fu-c2b14
-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000047_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index00
0002_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000027_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_l
s0666_index000065_fu-c2b14-43-01_pid675067.root' > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

doesn't reproduce.

dan131riley commented 1 month ago

@jalimena where can we find the full stack trace for all threads?

jalimena commented 1 month ago

@jalimena where can we find the full stack trace for all threads?

sorry i have uploaded the complete stack trace from the DAQ shifter in the issue description. Let me know if it looks incomplete.

mmusich commented 1 month ago

As the "regular" reproducing script (see https://github.com/cms-sw/cmssw/issues/45555#issuecomment-2249927240) doesn't lead to a crash I have explored the option of running filling with junk the memory on the host and device allocators (cf https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2199627075). Using the following script in CMSSW_14_0_11_MULTIARCHS [1] I get:

----- Begin Fatal Exception 25-Jul-2024 13:09:17 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 383631 lumi: 444 event: 1064704776 stream: 0
   [1] Running path 'DQM_HcalReconstruction_v8'
   [2] Calling method for module PFRecHitSoAProducerHCAL@alpaka/'hltParticleFlowRecHitHBHESoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alp
aka/event/EventUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
----- End Fatal Exception -------------------------------------------------

when doing the same in CMSSW_14_0_12_MULTIARCHS (which contains https://github.com/cms-sw/cmssw/pull/45374 ) I still get:

----- Begin Fatal Exception 25-Jul-2024 13:16:03 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 383631 lumi: 444 event: 1064704776 stream: 0
   [1] Running path 'DQM_HcalReconstruction_v8'
   [2] Calling method for module PFRecHitSoAProducerHCAL@alpaka/'hltParticleFlowRecHitHBHESoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_12_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_12_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alp
aka/event/EventUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
----- End Fatal Exception -------------------------------------------------

Not sure if that's expected.

[1]

Click me

```bash #!/bin/bash -ex # CMSSW_14_0_11_MULTIARCHS hltGetConfiguration run:383631 \ --globaltag 140X_dataRun3_HLT_v3 \ --data \ --no-prescale \ --no-output \ --max-events -1 \ --input '/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0444_index000423_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index000025_fu-c2b14 -43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000047_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index00 0002_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000027_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_l s0666_index000065_fu-c2b14-43-01_pid675067.root' > hlt.py cat <<@EOF >> hlt.py del process.MessageLogger process.load('FWCore.MessageService.MessageLogger_cfi') process.MessageLogger.CUDAService = {} process.MessageLogger.AlpakaService = {} process.load('HeterogeneousCore.CUDAServices.CUDAService_cfi') from HeterogeneousCore.AlpakaServices.AlpakaServiceCudaAsync_cfi import AlpakaServiceCudaAsync as _AlpakaServiceCudaAsync process.AlpakaServiceCudaAsync = _AlpakaServiceCudaAsync.clone( verbose = True, hostAllocator = dict( binGrowth = 2, minBin = 8, # 256 bytes maxBin = 30, # 1 GB maxCachedBytes = 64*1024*1024*1024, # 64 GB maxCachedFraction = 0.8, # or 80%, whatever is less fillAllocations = True, fillAllocationValue = 0xA5, fillReallocations = True, fillReallocationValue = 0x69, fillDeallocations = True, fillDeallocationValue = 0x5A, fillCaches = True, fillCacheValue = 0x96 ), deviceAllocator = dict( binGrowth = 2, minBin = 8, # 256 bytes maxBin = 30, # 1 GB maxCachedBytes = 8*1024*1024*1024, # 8 GB maxCachedFraction = 0.8, # or 80%, whatever is less fillAllocations = True, fillAllocationValue = 0xA5, fillReallocations = True, fillReallocationValue = 0x69, fillDeallocations = True, fillDeallocationValue = 0x5A, fillCaches = True, fillCacheValue = 0x96 ) ) process.options.wantSummary = True process.options.numberOfThreads = 1 process.options.numberOfStreams = 0 @EOF cmsRun hlt.py &> hlt.log ```

dan131riley commented 1 month ago

sorry i have uploaded the complete stack trace from the DAQ shifter in the issue description. Let me know if it looks incomplete.

Thanks. There's a possibly relevant error from the PrimaryVertexProducer, and I guess a question for physics is whether having the BasicSingleVertexState throw on invalid errors is the desired behavior:

%MSG-e UnusableBeamSpot:  PrimaryVertexProducer:hltVerticesPF  25-Jul-2024 08:49:19 CEST Run: 383631 Event: 1064710185
Beamspot with invalid errors [            0           0           0
             0           0           0
             0           0           0 ]
%MSG
----- Begin Fatal Exception 25-Jul-2024 08:49:19 CEST-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 383631 lumi: 444 event: 1064710185 stream: 16
   [1] Running path 'HLT_DoublePNetTauhPFJet30_Medium_L2NN_eta2p3_v4'
   [2] Calling method for module PrimaryVertexProducer/'hltVerticesPF'
Exception Message:
BasicSingleVertexState::could not invert error matrix 
----- End Fatal Exception -------------------------------------------------

The detailed stack trace shows that stack trace interpretation is getting complicated by the async GPU calls--the sig_dostack_then_abort identified as crashed looks to be an innocent bystander due to async ops getting triggered while the framework is shutting down. Tagging framework team, @Dr15Jones @makortel can we avoid throwing in these cases where the framework is already shutting down due to an exception?

mmusich commented 1 month ago

I didn't notice at first glance:

%MSG-e UnusableBeamSpot:  PrimaryVertexProducer:hltVerticesPF  25-Jul-2024 08:49:19 CEST Run: 383631 Event: 1064710185
Beamspot with invalid errors [            0           0           0
             0           0           0
             0           0           0 ]
%MSG

So the plot thickens and looks to go in the direction of the already observed https://github.com/cms-sw/cmssw/issues/41914 in which we suspected beamspot issues. @francescobrivio FYI

mmusich commented 1 month ago

assign hlt, reconstruction, alca

mmusich commented 1 month ago

type tracking

mmusich commented 1 month ago

@cms-sw/tracking-pog-l2 FYI

cmsbuild commented 1 month ago

New categories assigned: hlt,reconstruction,alca

@Martin-Grunewald,@mmusich,@jfernan2,@mandrenguyen,@saumyaphor4252,@perrotta,@consuegs you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich commented 1 month ago

spoon-feeding a null-covariance matrix fake BeamSpot to the HLT with:

diff --git a/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc b/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc
index 83aa832cfa5..56f0c8948b6 100644
--- a/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc
+++ b/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc
@@ -108,7 +108,7 @@ void BeamSpotOnlineProducer::produce(Event& iEvent, const EventSetup& iSetup) {
         edm::LogWarning("BeamSpotFromDB")
             << "Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot ";
       }
-      fallBackToDB = true;
+      //fallBackToDB = true;
     } else {
       // translate from BeamSpotObjects to reco::BeamSpot
       // in case we need to switch to LHC reference frame
diff --git a/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc b/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc
index 0b8c4233c7c..2c42e2b6d63 100644
--- a/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc
+++ b/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc
@@ -55,13 +55,21 @@ OnlineBeamSpotESProducer::OnlineBeamSpotESProducer(const edm::ParameterSet& p)
   fakeBS_.setPosition(0.0001, 0.0001, 0.0001);
   fakeBS_.setType(-1);
   // Set diagonal covariance, i.e. errors on the parameters
-  fakeBS_.setCovariance(0, 0, 5e-10);
-  fakeBS_.setCovariance(1, 1, 5e-10);
-  fakeBS_.setCovariance(2, 2, 0.002);
-  fakeBS_.setCovariance(3, 3, 0.002);
-  fakeBS_.setCovariance(4, 4, 5e-11);
-  fakeBS_.setCovariance(5, 5, 5e-11);
-  fakeBS_.setCovariance(6, 6, 1e-09);
+  fakeBS_.setCovariance(0, 0, 0.);
+  fakeBS_.setCovariance(1, 1, 0.);
+  fakeBS_.setCovariance(2, 2, 0.);
+  fakeBS_.setCovariance(3, 3, 0.);
+  fakeBS_.setCovariance(4, 4, 0.);
+  fakeBS_.setCovariance(5, 5, 0.);
+  fakeBS_.setCovariance(6, 6, 0.);

   bsHLTToken_ = cc.consumesFrom<BeamSpotOnlineObjects, BeamSpotOnlineHLTObjectsRcd>();
   bsLegacyToken_ = cc.consumesFrom<BeamSpotOnlineObjects, BeamSpotOnlineLegacyObjectsRcd>();
@@ -179,7 +187,7 @@ std::shared_ptr<const BeamSpotObjects> OnlineBeamSpotESProducer::produce(const B
     return std::shared_ptr<const BeamSpotObjects>(best, edm::do_nothing_deleter());
   }
   edm::LogWarning("OnlineBeamSpotESProducer")
-      << "None of the Online BeamSpots in the ES is suitable, \n returning a fake one(fallback to PCL).";
+      << "None of the Online BeamSpots in the ES is suitable, \n returning a fake one (fallback to PCL).";
   return std::shared_ptr<const BeamSpotObjects>(&fakeBS_, edm::do_nothing_deleter());
 }

I sort of reproduce the crash:

 %MSG-e UnusableBeamSpot:  PrimaryVertexProducer:hltVerticesPF  25-Jul-2024 14:08:04 CEST Run: 383631 Event: 1064704786
Beamspot with invalid errors [            0           0           0
             0           0           0
             0           0           0 ]
%MSG
----- Begin Fatal Exception 25-Jul-2024 14:08:04 CEST-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 383631 lumi: 444 event: 1064704786 stream: 0
   [1] Running path 'HLT_ZeroBias_Beamspot_v14'
   [2] Calling method for module PrimaryVertexProducer/'hltVerticesPF'
Exception Message:
BasicSingleVertexState::could not invert error matrix 
----- End Fatal Exception -------------------------------------------------

the questions are:

how comes the HLT consumed such a beam spot (I don't see mentions of fallbacks to the PCL in the log, so it means it was consuming the one coming from one of the DQM online clients)
more importantly why when re-running a posteriori there's no crash.

mmusich commented 1 month ago

more importantly why when re-running a posteriori there's no crash.

there's no such payload in the DB for run 383631:

BeamSpotOnlineLegacy:

Screenshot from 2024-07-25 14-57-46

BeamSpotOnlineHLT:

Screenshot from 2024-07-25 14-57-36

slava77 commented 1 month ago

there's no such payload in the DB for run 383631:

does it look like there was a glitch in the DB access or did the zeroes show up from another part of the logic in the BS producer?

mmusich commented 1 month ago

does it look like there was a glitch in the DB access

Possible, but it's not clear from where it would read it then (I would have expected a framework exception). Notice that even if it had have fallen back to the PCL (the logs do not mention it) at the time of running the last open IoV would have been this one:

did the zeroes show up from another part of the logic in the BS producer?

By looking at the code it's not clear to me from where it could come.

makortel commented 1 month ago

Tagging framework team, @Dr15Jones @makortel can we avoid throwing in these cases where the framework is already shutting down due to an exception?

It could indeed make sense to demote the exception

----- Begin Fatal Exception 25-Jul-2024 08:49:19 CEST-----------------------
An exception of category 'AsyncCallNotAllowed' occurred while
   [0] Calling Async::run()
Exception Message:
Framework is shutting down, further run() calls are not allowed
----- End Fatal Exception -------------------------------------------------
%MSG-w FastMonitoringService:  PostProcessPath 25-Jul-2024 08:49:19 CEST  Run: 383631 Event: 1064835836
 STREAM 9 earlyTermination -: ID:run: 383631 lumi: 445 event: 1064835836 LS:445  FromThisContext
%MSG
terminate called after throwing an instance of 'cms::Exception'
  what():  An exception of category 'AsyncCallNotAllowed' occurred while
   [0] Calling Async::run()
Exception Message:
Framework is shutting down, further run() calls are not allowed

to LogWarning (even if the "Alpaka framework" should avoid double exceptions in this particular case where the "CUDA framework" causes them).

makortel commented 1 month ago

Tagging framework team, @Dr15Jones @makortel can we avoid throwing in these cases where the framework is already shutting down due to an exception?

It could indeed make sense to demote the exceptionFramework is shutting down, further run() calls are not allowed to LogWarning (even if the "Alpaka framework" should avoid double exceptions in this particular case where the "CUDA framework" causes them).

After are more careful look I convinced myself this case has to stay as an exception. Generally we want to stop processing as soon as possible when this situation happens (e.g. in this case to prevent calling HcalDigisProducerGPU::produce() when HcalDigisProducerGPU::acquire() did not synchronize properly), and exception is (by far) the easiest way to communicate that. As I mentioned above, in the "Alpaka framework" addresses the "double exception" deficiency of the "CUDA framework", and at this point I think addressing the latter would not be worth of the effort (because it will be phased out "soon").

Modules in different streams throwing exceptions should be fine, and two independent modules processed by one stream throwing exceptions should also be fine. From the (core) framework perspective the std::terminate() being called here is an (unfortunate) property of the HcalDigisProducerGPU module (really the "CUDA framework").

I would add tests for the "Alpaka framework" that an Alpaka module throwing an exception in acquire() doesn't lead to std::terminate() even then the framework is shutting down at the time because of an exception being thrown elsewhere.

makortel commented 1 month ago

Written that, I wonder what the other exception HcalDigisProducerGPU::acquire() is. Oh wait, the destructor of ScopedContextAcquire is not marked as noexcept(false) https://github.com/cms-sw/cmssw/blob/0efbba9e38c6e067e9327a6aac257083586cbaf5/HeterogeneousCore/CUDACore/interface/ScopedContext.h#L123 which I'd bet is the cause for std::terminate() here 🤦 . I'll fix that.

francescobrivio commented 1 month ago

even if it had have fallen back to the PCL (the logs do not mention it)

Just to add some information, indeed the hltd log doesn't seem to mention any fallback to PCL, but the online DQM monitoring under HLT/BeamSpotMonitor/Validation/bsChoice, which should exactly tell which of the BeamSpot is being consumed, seem to indicate that exactly at LSs 441-444 the "fallback to PCL mechanism" kicked in:

But in any case, as Marco mentioned above, the payload consumed from PCL would have been OK.

mmusich commented 1 month ago

indeed the hltd log doesn't seem to mention any fallback to PCL,

It's not clear why e.g. at least this:

https://github.com/cms-sw/cmssw/blob/4c86b77b1a317283036b4535bacc12c510d53e9d/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc#L181-L182

is not in the logs, but for sure this:

https://github.com/cms-sw/cmssw/blob/4c86b77b1a317283036b4535bacc12c510d53e9d/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc#L108-L109

is manually suppressed in the HLT menu itself via the following snippet:

process.MessageLogger = cms.Service( "MessageLogger",
    suppressWarning = cms.untracked.vstring( 'hltL3MuonsIOHit',
      'hltL3MuonsOIHit',
      'hltL3MuonsOIState',
      'hltLightPFTracks',
      'hltOnlineBeamSpot',
      'hltPixelTracks',
      'hltPixelTracksForHighMult',
      'hltSiPixelClusters',
      'hltSiPixelDigis' ),
    suppressFwkInfo = cms.untracked.vstring(  ),
    suppressInfo = cms.untracked.vstring(  ),
    suppressDebug = cms.untracked.vstring(  ),
    debugModules = cms.untracked.vstring(  ),
    cerr = cms.untracked.PSet( 
      INFO = cms.untracked.PSet(  limit = cms.untracked.int32( 0 ) ),
      noTimeStamps = cms.untracked.bool( False ),
      FwkReport = cms.untracked.PSet( 
        reportEvery = cms.untracked.int32( 1 ),
        limit = cms.untracked.int32( 0 )
      ),
      default = cms.untracked.PSet(  limit = cms.untracked.int32( 10000000 ) ),
      Root_NoDictionary = cms.untracked.PSet(  limit = cms.untracked.int32( 0 ) ),
      FwkJob = cms.untracked.PSet(  limit = cms.untracked.int32( 0 ) ),
      FwkSummary = cms.untracked.PSet( 
        reportEvery = cms.untracked.int32( 1 ),
        limit = cms.untracked.int32( 10000000 )
      ),
      threshold = cms.untracked.string( "INFO" ),
      enableStatistics = cms.untracked.bool( False )
    ),
    suppressError = cms.untracked.vstring( 'hltL3TkTracksFromL2IOHit',
      'hltL3TkTracksFromL2OIHit',
      'hltL3TkTracksFromL2OIState',
      'hltOnlineBeamSpot' )
)

I propose to lift the suppression at https://its.cern.ch/jira/browse/CMSHLT-3298

Dr15Jones commented 1 month ago

@mmusich one possibility for a missing message is if the MessageLogger gets too many request for messages at the same time it will drop some of them in order to keep up with the requests.

mmusich commented 1 month ago

one possibility for a missing message is if the MessageLogger gets too many request for messages at the same time it will drop some of them in order to keep up with the requests.

that's a possibility, but I am not very convinced as there are other warnings from the same event that crashed (just before the crash).

mmusich commented 1 month ago

which should exactly tell which of the BeamSpot is being consumed, seem to indicate that exactly at LSs 441-444 the "fallback to PCL mechanism" kicked in.

@cms-sw/db-l2 @PonIlya is it possible to have an audit of the P5 frontier squid at the time of the crash ( 25-Jul-2024 08:49:19 )?

PonIlya commented 4 weeks ago

which should exactly tell which of the BeamSpot is being consumed, seem to indicate that exactly at LSs 441-444 the "fallback to PCL mechanism" kicked in.

@cms-sw/db-l2 @PonIlya is it possible to have an audit of the P5 frontier squid at the time of the crash ( 25-Jul-2024 08:49:19 )? Do you need log analysis for this time period? What exactly are we expecting to find? I think I need to create a ticket for Frontier support team.

mmusich commented 4 weeks ago

@PonIlya , thank you for the reply.

Do you need log analysis for this time period? What exactly are we expecting to find? I think I need to create a ticket for Frontier support team.

to be perfectly honest I am not sure, but there are indications that somehow the connection to the DB "glitched" during this period of time. As a DB expert you are perhaps in a better position to judge if there was any anomaly during that time.

PonIlya commented 3 weeks ago

@mmusich Unfortunately, the logs are no longer available for audit. Let me know if a similar issue arises again so that I can respond more quickly. The logs are deleted once they reach a certain size. For example, as of today, only logs from August 7th are available.

mmusich commented 3 weeks ago

@PonIlya thanks.

Let me know if a similar issue arises again so that I can respond more quickly.

Sure, will do. In the meanwhile perhaps you could take a look to the related issue https://github.com/cms-sw/cmssw/issues/45714

cms-sw / cmssw

HLT crash in run 383631 - 'VertexException' and HcalDigisProducerGPU:hltHcalDigisGPU error #45555