HLT crash in after V1.4.1 menu deploy - 'cudaErrorIllegalAddress'/alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830)

vince502 commented 1 month ago

During the recent pp phyiscs fills 9945, 9947 we have deployed the new HLT menu /cdaq/physics/Run2024/2e34/v1.4.1/HLT/V2, and we started to see crashes in processes during the run (elog). The crashes comes from an illegal memory access.

In particular, we quote the errors from run 383830 in this issue.

So far on the offline side crashes were observed using the same error stream files.

From f3mon crashes in process shows like,

An exception of category 'StdException' occurred while
[0] Processing Event run: 383830 lumi: 83 event: 113500824 stream: 22
[1] Running path 'DST_PFScouting_DoubleMuon_v5'
[2] Calling method for module HcalDigisSoAProducer@alpaka/'hltHcalDigisSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_12_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_12_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUniformCudaHipRt.hpp(143) 'ret = TApi::eventQuery(event.getNativeHandle())' returned error : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!

Setting to produce crashes (not necessarily same exact problem) in the streamers

# I first copied files from /eos/cms/store/group/tsg/FOG/error_stream_root/run383830/
# using gpu-c2a02-39-04
cmsrel CMSSW_14_0_12_MULTIARCHS
cd CMSSW_14_0_12_MULTIARCHS/src/
cmsenv
mkdir crashAlpakaFiles
scp lxplus.cern.ch:/eos/cms/store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root crashAlpakaFiles

hltGetConfiguration run:383830 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'file:crashAlpakaFiles/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root' \
> hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1 # 8 for method 2 below
process.options.numberOfStreams = 0

@EOF

In the hlt.py I have tried several settings to reproduce the same cuda memory access error,

Plain run

cmsRun hlt.py 2>&1 | tee log_method1.txt

will result in crash by Module: alpaka_serial_sync::PFRecHitSoAProducerHCAL:hltParticleFlowRecHitHBHESoASerialSync (crashed) Output : log_method1.txt

Force enable using cuda modules

echo "process.options.accelerators = ['gpu-nvidia']" >> hlt.py

AND removed the following paths from process.schedule = cms.Schedule( ... block in hlt.py

process.DQM_PixelReconstruction_v11, 
process.DQM_EcalReconstruction_v11, 
process.DQM_HcalReconstruction_v9,
process.Dataset_DQMGPUvsCPU

and run cmsRun hlt.py 2>&1 | tee logForce_method2.txt

Output : logForce_method2.txt

cmsbuild commented 1 month ago

cms-bot internal usage

cmsbuild commented 1 month ago

A new Issue was created by @vince502.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 1 month ago

assign hlt, heterogeneous, reconstruction

cmsbuild commented 1 month ago

New categories assigned: hlt,heterogeneous,reconstruction

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

jsamudio commented 1 month ago

Investigating from PF side

mmusich commented 1 month ago

@cms-sw/pf-l2 FYI

mmusich commented 1 month ago

type pf

jsamudio commented 1 month ago

For the serial sync crash, with gdb I see:

Thread 1 "cmsRun" received signal SIGSEGV, Segmentation fault.
0x00007fff3463d43a in alpaka_serial_sync::PFRecHitProducerKernelConstruct<alpaka_serial_sync::particleFlowRecHitProducer::HCAL>::applyCuts (rh=..., params=..., topology=...)
    at src/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc:63
63            threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];

And printing out the detId of the hit, I see that it is detId == 0, meaning denseId == HCAL::kInvalidDenseId which is std::numeric_limits<uint32_t>::max() and thus giving the segfault here.

I will check with the method that produces the CUDA errors on GPU, but generally I would suspect the same thing if there is a rechit with detId == 0.

mmusich commented 1 month ago

@kakwok FYI

mmusich commented 1 month ago

also @abdoulline @cms-sw/hcal-dpg-l2

mmusich commented 1 month ago

I will check with the method that produces the CUDA errors on GPU, but generally I would suspect the same thing if there is a rechit with detId == 0.

FWIW I confirm that with this:

diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..cd7f215abf1 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,6 +59,12 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     const uint32_t detId = rh.detId();
     const uint32_t depth = HCAL::getDepth(detId);
     const uint32_t subdet = getSubdet(detId);
+
+    if (detId == 0) {
+      printf("Rechit with detId %u has subdetector %u and depth %u ! \n", detId, subdet, depth);
+      return false;
+    }
+
     if (topology.cutsFromDB()) {
       threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];
     } else {

the reproducer runs to completion both forcing the backend to be serial [1] or gpu [2]

[1] CPU only

```bash #!/bin/bash -ex # CMSSW_14_0_12_MULTIARCHS hltGetConfiguration run:383830 \ --globaltag 140X_dataRun3_HLT_v3 \ --data \ --no-prescale \ --no-output \ --max-events -1 \ --input /store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root > hlt.py cat <<@EOF >> hlt.py try: del process.MessageLogger process.load('FWCore.MessageLogger.MessageLogger_cfi') process.MessageLogger.cerr.enableStatistics = False except: pass process.source.skipEvents = cms.untracked.uint32( 74 ) process.options.wantSummary = True process.options.numberOfThreads = 1 process.options.numberOfStreams = 0 process.options.accelerators = ['cpu'] @EOF cmsRun hlt.py &> hlt.log ```

[2] GPU only

```bash #!/bin/bash -ex # CMSSW_14_0_12_MULTIARCHS hltGetConfiguration run:383830 \ --globaltag 140X_dataRun3_HLT_v3 \ --data \ --no-prescale \ --no-output \ --max-events -1 \ --input /store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root > hlt2.py cat <<@EOF >> hlt2.py try: del process.MessageLogger process.load('FWCore.MessageLogger.MessageLogger_cfi') process.MessageLogger.cerr.enableStatistics = False except: pass process.source.skipEvents = cms.untracked.uint32( 74 ) process.options.wantSummary = True process.options.numberOfThreads = 1 process.options.numberOfStreams = 0 process.options.accelerators = ['gpu-nvidia'] rmPaths = set() for pathName in process.paths_(): if 'DQM' in pathName: rmPaths.add(pathName) elif 'CPUOnly' in pathName: rmPaths.add(pathName) for rmPath in rmPaths: process.__delattr__(rmPath) process.hltAlCaPFJet40GPUxorCPUFilter.triggerConditions=cms.vstring( 'AlCa_PFJet40_v31' ) @EOF cmsRun hlt2.py &> hlt2.log ```

mmusich commented 1 month ago

A slightly more elegant version is:

diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..576342bc16a 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,8 +59,14 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     const uint32_t detId = rh.detId();
     const uint32_t depth = HCAL::getDepth(detId);
     const uint32_t subdet = getSubdet(detId);
+
     if (topology.cutsFromDB()) {
-      threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];
+      const auto& denseId = HCAL::detId2denseId(detId);
+      if (denseId != HCAL::kInvalidDenseId) {
+        threshold = topology.noiseThreshold()[denseId];
+      } else {
+        return false;
+      }
     } else {
       if (subdet == HcalBarrel) {
         threshold = params.energyThresholds()[depth - 1];

of course it remains to be understood the origin of such rechits associated to detid = 0. By the way I guess it would help if we would be emitting different printouts here

https://github.com/cms-sw/cmssw/blob/3a385433ede8b150a8e01c1f8abb219dfdb8a50f/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CalorimeterDefinitions.h#L114

vs here:

https://github.com/cms-sw/cmssw/blob/3a385433ede8b150a8e01c1f8abb219dfdb8a50f/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CalorimeterDefinitions.h#L200

missirol commented 1 month ago

@kakwok @jsamudio

One question (maybe just for my own understanding).

In HcalRecHitSoAToLegacy, hits in SoA format corresponding to "bad channels" (chi2 < 0) are skipped (not converted to legacy).

Should they (and if so, are they), also skipped in the PFRecHit+Cluster reconstruction in Alpaka (which starts from the HBHE RecHits in SoA format) ?

kakwok commented 1 month ago

If I understand correctly, yes; those rechits with chi2<0 will be skipped in the subsequent PF reconstruction. And it seems to make sense for me to skip those rechits.

jsamudio commented 1 month ago

If those bad hits exist in the SoA that is passed to PF RecHit, then I see no explicit skip over chi2 <0 on our side. And our first check is just the energy threshold.

fwyzard commented 1 month ago

As a quick workaround for the crashes, would it make sense to add back the conversion to legacy, and from legacy to SoA ?

This should filter away the bad hits and prevent the crashes, and could be implemented as a configuration-only change, while a better fix is worked on.

missirol commented 1 month ago

Maybe the question is not to me, but I think it's a good idea (i would prefer to fix the release without touching the menu, but in this case this should give the correct results, and reduce pressure on different fronts..).

It's implemented in the menu below, and the latter does not crash on the reproducer of this issue.

/cdaq/test/missirol/dev/CMSSW_14_0_0/tmp/240731_cmssw45595/Test01/HLT/V2

@cms-sw/hlt-l2, if you agree, I will follow up with FOG in a CMSHLT ticket (and I will ask you there to double-check the menu).

mmusich commented 1 month ago

if you agree, I will follow up with FOG in a CMSHLT ticket

is it costless?

missirol commented 1 month ago

I will check that in parallel. :)

mmusich commented 1 month ago

I will check that in parallel. :)

OK. For posterity here's the diff of the proposed menu w.r.t. the latest online one.

mmusich commented 1 month ago

i would prefer to fix the release without touching the menu,

diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..39f1948f73c 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,8 +59,17 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     const uint32_t detId = rh.detId();
     const uint32_t depth = HCAL::getDepth(detId);
     const uint32_t subdet = getSubdet(detId);
+
+    if (rh.chi2() < 0)
+      return false;
+
     if (topology.cutsFromDB()) {

this also prevents the crash in the reproducers https://github.com/cms-sw/cmssw/issues/45595#issuecomment-2259202142

Martin-Grunewald commented 1 month ago

I guess the config workaround is fine (for online/FOG) while the C++ fix should be used as soon as possible after.

fwyzard commented 1 month ago

i would prefer to fix the release without touching the menu,

diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..39f1948f73c 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,8 +59,17 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     const uint32_t detId = rh.detId();
     const uint32_t depth = HCAL::getDepth(detId);
     const uint32_t subdet = getSubdet(detId);
+
+    if (rh.chi2() < 0)
+      return false;
+
     if (topology.cutsFromDB()) {

this also prevents the crash in the reproducers #45595 (comment)

To me the main question is: if the rechits with negative chi2 are invalid and should never be used by the downstream consumers, can we find a way to not produce them in the first place ?

mmusich commented 1 month ago

OK. For posterity here's the diff of the proposed menu w.r.t. the latest online one.

FWIW, I checked that this script (using @missirol's menu) runs OK in CMSSW_14_0_12_MULTIARCH (tout-court):

#!/bin/bash -ex

# CMSSW_14_0_12_MULTIARCHS

# Directory name
dir="run000000"

# Check if the directory does not exist
if [ ! -d "$dir" ]; then
    # Create the directory
    mkdir "$dir"
    echo "Directory $dir created."
else
    echo "Directory $dir already exists."
fi

hltConfigFromDB --adg \
        --configName /cdaq/test/missirol/dev/CMSSW_14_0_0/tmp/240731_cmssw45595/Test01/HLT/V2 \
        --nooutput \
        --input /store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root > hlt.py

cat <<@EOF >> hlt.py
try:
    del process.MessageLogger
    process.load('FWCore.MessageLogger.MessageLogger_cfi')
    process.MessageLogger.cerr.enableStatistics = False
except:
    pass

process.source.skipEvents = cms.untracked.uint32( 74 )
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
#process.options.accelerators = ['cpu']
#process.options.accelerators = ['gpu-nvidia']
@EOF

cmsRun hlt.py &> hlt.log

I'll proceed testing over all the available error stream files.

missirol commented 1 month ago

The ticket to patch the online HLT menus to avoid these crashes is CMSHLT-3302 (thanks Andrea for the suggestion, yesterday it did not cross my mind).

To me the main question is: if the rechits with negative chi2 are invalid and should never be used by the downstream consumers, can we find a way to not produce them in the first place ?

Since I don't know how long this could take, my 2 cents is to still include https://github.com/cms-sw/cmssw/issues/45595#issuecomment-2259775122 in a (patch) release that we can realistically deploy by next week, so we can undo CMSHLT-3302 by then.

mmusich commented 1 month ago

Since I don't know how long this could take, my 2 cents is to still include https://github.com/cms-sw/cmssw/issues/45595#issuecomment-2259775122 in a (patch) release that we can realistically deploy by next week, so we can undo CMSHLT-3302 by then.

OK, here's a draft https://github.com/cms-sw/cmssw/pull/45604 (we can close it if something better comes in the meanwhile).

jsamudio commented 1 month ago

Thanks @mmusich, IMO I think these protections against bad channels and invalid detId should have been in place anyhow. The logic for the non-DB thresholds would have skipped such rechits as well. I apologize that I didn't catch it sooner.

missirol commented 1 month ago

Thanks @mmusich @jsamudio !

missirol commented 1 month ago

As a side note, for the one event that I checked amongst those causing this crash, one gets the following warning when running the legacy HBHERecHit producer (using an older HLT menu).

Begin processing the 1st record. Run 383830, Event 113486368, LumiSection 83 on stream 0 at 31-Jul-2024 19:21:24.279 CEST
%MSG-w HBHEDigi:  HBHEPhase1Reconstructor:hltHbherecoLegacy  31-Jul-2024 19:21:24 CEST Run: 383830 Event: 113486368
 bad SOI/maxTS in cell (HB -8,59,1)
 expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
 got maxTS = 8, SOI = -1
%MSG

I don't know if this is the same hit as the one leading to the crash, but, just for my understanding, could HCAL experts explain

what could cause this kind of hit to be present
whether or not it is expected to happen (or, it should never happen)
whether or not skipping this hit in downstream reconstruction modules is the correct action to take (i guess so, but just to make sure)

@cms-sw/hcal-dpg-l2

abdoulline commented 1 month ago

Hi @missirol (not trying to wear HCAL expert's hat)

looks like a rare local (RM=readout module or fiber) data corruption, as HB/HE are configured to have SOI=3 (Sample Of Interest, the trigger TS) in the HCAL HB/HE Digi array of 8TS
is not expected to happen
skipping this hit is a necessity and it's done in the HCAL local reco

kakwok commented 1 month ago

I printed out the input digis to the SoA converter from here and the digi data does NOT seem corrupted for that channel

Begin processing the 1st record. Run 383830, Event 113486368, LumiSection 83 on stream 0 at 31-Jul-2024 22:32:04.622 CEST
[Alpaka digi input] (HB -8,59,1) digi = DetID=(HB -8,59,1) flavor=3 8 samples
  ADC=9 TDC=3 CAPID=1
  ADC=10 TDC=3 CAPID=2
  ADC=21 TDC=3 CAPID=3
  ADC=239 TDC=1 CAPID=0
  ADC=12 TDC=3 CAPID=1
  ADC=16 TDC=3 CAPID=2
  ADC=12 TDC=3 CAPID=3
  ADC=17 TDC=3 CAPID=0

abdoulline commented 1 month ago

I printed out the input digis to the SoA converter from here and the digi data does NOT seem corrupted for that channel

The printout shows ADC/TDC pattern, which seems to be OK, indeed, It misses info about SOI (= digi,presamples() ). But if SOI is incorrect (-1 as in the warning above, provided by Marion), then the algo skips this hit the same way it does with bad channels listed in DB (in HcalChannelQuality),

This protection in the legacy producer code above was introduced in 2021 https://github.com/cms-sw/cmssw/pull/35944

when there was an occurrence of the bad SOI, which crashed the Prompt reco...

abdoulline commented 1 month ago

Let me explicitly involve @mariadalfonso and @igv4321 in the discussion from HCAL side.

kakwok commented 1 month ago

ah, good catch! Indeed the SOI is missing in the printout. Which means this line sample.soi() is not true: https://github.com/cms-sw/cmssw/blob/master/DataFormats/HcalDigi/interface/QIE11DataFrame.h#L44

missirol commented 1 month ago

looks like a rare local (RM=readout module or fiber) data corruption, as HB/HE are configured to have SOI=3 (Sample Of Interest, the trigger TS) in the HCAL HB/HE Digi array of 8TS

is not expected to happen

skipping this hit is a necessity and it's done in the HCAL local reco

Thanks for the explanations, @abdoulline .

mmusich commented 1 month ago

proposed fixes:

https://github.com/cms-sw/cmssw/pull/45604 (master)
https://github.com/cms-sw/cmssw/pull/45611 (14.0.X, entered CMSSW_14_0_13_patch1)

mmusich commented 1 month ago

+hlt

see https://github.com/cms-sw/cmssw/issues/45595#issuecomment-2270721663
CMSSW_14_0_13_patch1 (containing the fix) deployed online on Aug 5th, 2024 together with new HLT menus (reverting the workaround CMSHLT-3302, collisions menu version v1.4.3 and circulating, cosmic v1.5.2), see HLT report in daily run meeting report of Aug 6th, 2024. No further crashes related to this issues are observed.

mmusich commented 3 weeks ago

@cms-sw/reconstruction-l2 @cms-sw/heterogeneous-l2 please consider signing this if there is no other follow up from your area, such that we could close this issue.

fwyzard commented 3 weeks ago

I would like to avoid producing the invalid channels at all, but since this is now tracked in https://github.com/cms-sw/cmssw/issues/45651, we can close this issue.

fwyzard commented 3 weeks ago

+heterogeneous

jfernan2 commented 1 week ago

+1

cmsbuild commented 1 week ago

This issue is fully signed and ready to be closed.

makortel commented 1 week ago

@cmsbuild, please close

cms-sw / cmssw

HLT crash in after V1.4.1 menu deploy - 'cudaErrorIllegalAddress'/alpaka_serial_sync::PFRecHitSoAProducerHCAL (run 383830) #45595