Closed vince502 closed 1 week ago
cms-bot internal usage
A new Issue was created by @vince502.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign hlt, heterogeneous, reconstruction
New categories assigned: hlt,heterogeneous,reconstruction
@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
Investigating from PF side
@cms-sw/pf-l2 FYI
type pf
For the serial sync crash, with gdb
I see:
Thread 1 "cmsRun" received signal SIGSEGV, Segmentation fault.
0x00007fff3463d43a in alpaka_serial_sync::PFRecHitProducerKernelConstruct<alpaka_serial_sync::particleFlowRecHitProducer::HCAL>::applyCuts (rh=..., params=..., topology=...)
at src/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc:63
63 threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];
And printing out the detId of the hit, I see that it is detId == 0
, meaning denseId == HCAL::kInvalidDenseId
which is std::numeric_limits<uint32_t>::max()
and thus giving the segfault here.
I will check with the method that produces the CUDA errors on GPU, but generally I would suspect the same thing if there is a rechit with detId == 0
.
@kakwok FYI
also @abdoulline @cms-sw/hcal-dpg-l2
I will check with the method that produces the CUDA errors on GPU, but generally I would suspect the same thing if there is a rechit with detId == 0.
FWIW I confirm that with this:
diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..cd7f215abf1 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,6 +59,12 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
const uint32_t detId = rh.detId();
const uint32_t depth = HCAL::getDepth(detId);
const uint32_t subdet = getSubdet(detId);
+
+ if (detId == 0) {
+ printf("Rechit with detId %u has subdetector %u and depth %u ! \n", detId, subdet, depth);
+ return false;
+ }
+
if (topology.cutsFromDB()) {
threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];
} else {
the reproducer runs to completion both forcing the backend to be serial [1] or gpu [2]
A slightly more elegant version is:
diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..576342bc16a 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,8 +59,14 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
const uint32_t detId = rh.detId();
const uint32_t depth = HCAL::getDepth(detId);
const uint32_t subdet = getSubdet(detId);
+
if (topology.cutsFromDB()) {
- threshold = topology.noiseThreshold()[HCAL::detId2denseId(detId)];
+ const auto& denseId = HCAL::detId2denseId(detId);
+ if (denseId != HCAL::kInvalidDenseId) {
+ threshold = topology.noiseThreshold()[denseId];
+ } else {
+ return false;
+ }
} else {
if (subdet == HcalBarrel) {
threshold = params.energyThresholds()[depth - 1];
of course it remains to be understood the origin of such rechits associated to detid = 0. By the way I guess it would help if we would be emitting different printouts here
vs here:
@kakwok @jsamudio
One question (maybe just for my own understanding).
In HcalRecHitSoAToLegacy
, hits in SoA format corresponding to "bad channels" (chi2 < 0) are skipped (not converted to legacy).
Should they (and if so, are they), also skipped in the PFRecHit+Cluster reconstruction in Alpaka (which starts from the HBHE RecHits in SoA format) ?
If I understand correctly, yes; those rechits with chi2<0 will be skipped in the subsequent PF reconstruction. And it seems to make sense for me to skip those rechits.
If those bad hits exist in the SoA that is passed to PF RecHit, then I see no explicit skip over chi2 <0
on our side. And our first check is just the energy threshold.
As a quick workaround for the crashes, would it make sense to add back the conversion to legacy, and from legacy to SoA ?
This should filter away the bad hits and prevent the crashes, and could be implemented as a configuration-only change, while a better fix is worked on.
Maybe the question is not to me, but I think it's a good idea (i would prefer to fix the release without touching the menu, but in this case this should give the correct results, and reduce pressure on different fronts..).
It's implemented in the menu below, and the latter does not crash on the reproducer of this issue.
/cdaq/test/missirol/dev/CMSSW_14_0_0/tmp/240731_cmssw45595/Test01/HLT/V2
@cms-sw/hlt-l2, if you agree, I will follow up with FOG in a CMSHLT ticket (and I will ask you there to double-check the menu).
if you agree, I will follow up with FOG in a CMSHLT ticket
is it costless?
I will check that in parallel. :)
I will check that in parallel. :)
OK. For posterity here's the diff of the proposed menu w.r.t. the latest online one.
i would prefer to fix the release without touching the menu,
diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
index 40d7b2315c8..39f1948f73c 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
@@ -59,8 +59,17 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
const uint32_t detId = rh.detId();
const uint32_t depth = HCAL::getDepth(detId);
const uint32_t subdet = getSubdet(detId);
+
+ if (rh.chi2() < 0)
+ return false;
+
if (topology.cutsFromDB()) {
this also prevents the crash in the reproducers https://github.com/cms-sw/cmssw/issues/45595#issuecomment-2259202142
I guess the config workaround is fine (for online/FOG) while the C++ fix should be used as soon as possible after.
i would prefer to fix the release without touching the menu,
diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc index 40d7b2315c8..39f1948f73c 100644 --- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc +++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc @@ -59,8 +59,17 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE { const uint32_t detId = rh.detId(); const uint32_t depth = HCAL::getDepth(detId); const uint32_t subdet = getSubdet(detId); + + if (rh.chi2() < 0) + return false; + if (topology.cutsFromDB()) {
this also prevents the crash in the reproducers #45595 (comment)
To me the main question is: if the rechits with negative chi2 are invalid and should never be used by the downstream consumers, can we find a way to not produce them in the first place ?
OK. For posterity here's the diff of the proposed menu w.r.t. the latest online one.
FWIW, I checked that this script (using @missirol's menu) runs OK in CMSSW_14_0_12_MULTIARCH
(tout-court):
#!/bin/bash -ex
# CMSSW_14_0_12_MULTIARCHS
# Directory name
dir="run000000"
# Check if the directory does not exist
if [ ! -d "$dir" ]; then
# Create the directory
mkdir "$dir"
echo "Directory $dir created."
else
echo "Directory $dir already exists."
fi
hltConfigFromDB --adg \
--configName /cdaq/test/missirol/dev/CMSSW_14_0_0/tmp/240731_cmssw45595/Test01/HLT/V2 \
--nooutput \
--input /store/group/tsg/FOG/error_stream_root/run383830/run383830_ls0083_index000316_fu-c2b01-26-01_pid4060272.root > hlt.py
cat <<@EOF >> hlt.py
try:
del process.MessageLogger
process.load('FWCore.MessageLogger.MessageLogger_cfi')
process.MessageLogger.cerr.enableStatistics = False
except:
pass
process.source.skipEvents = cms.untracked.uint32( 74 )
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
#process.options.accelerators = ['cpu']
#process.options.accelerators = ['gpu-nvidia']
@EOF
cmsRun hlt.py &> hlt.log
I'll proceed testing over all the available error stream files.
The ticket to patch the online HLT menus to avoid these crashes is CMSHLT-3302 (thanks Andrea for the suggestion, yesterday it did not cross my mind).
To me the main question is: if the rechits with negative chi2 are invalid and should never be used by the downstream consumers, can we find a way to not produce them in the first place ?
Since I don't know how long this could take, my 2 cents is to still include https://github.com/cms-sw/cmssw/issues/45595#issuecomment-2259775122 in a (patch) release that we can realistically deploy by next week, so we can undo CMSHLT-3302 by then.
Since I don't know how long this could take, my 2 cents is to still include https://github.com/cms-sw/cmssw/issues/45595#issuecomment-2259775122 in a (patch) release that we can realistically deploy by next week, so we can undo CMSHLT-3302 by then.
OK, here's a draft https://github.com/cms-sw/cmssw/pull/45604 (we can close it if something better comes in the meanwhile).
Thanks @mmusich, IMO I think these protections against bad channels and invalid detId
should have been in place anyhow. The logic for the non-DB thresholds would have skipped such rechits as well. I apologize that I didn't catch it sooner.
Thanks @mmusich @jsamudio !
As a side note, for the one event that I checked amongst those causing this crash, one gets the following warning when running the legacy HBHERecHit producer (using an older HLT menu).
Begin processing the 1st record. Run 383830, Event 113486368, LumiSection 83 on stream 0 at 31-Jul-2024 19:21:24.279 CEST
%MSG-w HBHEDigi: HBHEPhase1Reconstructor:hltHbherecoLegacy 31-Jul-2024 19:21:24 CEST Run: 383830 Event: 113486368
bad SOI/maxTS in cell (HB -8,59,1)
expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
got maxTS = 8, SOI = -1
%MSG
I don't know if this is the same hit as the one leading to the crash, but, just for my understanding, could HCAL experts explain
@cms-sw/hcal-dpg-l2
Hi @missirol (not trying to wear HCAL expert's hat)
I printed out the input digis to the SoA converter from here and the digi data does NOT seem corrupted for that channel
Begin processing the 1st record. Run 383830, Event 113486368, LumiSection 83 on stream 0 at 31-Jul-2024 22:32:04.622 CEST
[Alpaka digi input] (HB -8,59,1) digi = DetID=(HB -8,59,1) flavor=3 8 samples
ADC=9 TDC=3 CAPID=1
ADC=10 TDC=3 CAPID=2
ADC=21 TDC=3 CAPID=3
ADC=239 TDC=1 CAPID=0
ADC=12 TDC=3 CAPID=1
ADC=16 TDC=3 CAPID=2
ADC=12 TDC=3 CAPID=3
ADC=17 TDC=3 CAPID=0
I printed out the input digis to the SoA converter from here and the digi data does NOT seem corrupted for that channel
The printout shows ADC/TDC pattern, which seems to be OK, indeed, It misses info about SOI (= digi,presamples() ). But if SOI is incorrect (-1 as in the warning above, provided by Marion), then the algo skips this hit the same way it does with bad channels listed in DB (in HcalChannelQuality),
This protection in the legacy producer code above was introduced in 2021 https://github.com/cms-sw/cmssw/pull/35944
when there was an occurrence of the bad SOI, which crashed the Prompt reco...
Let me explicitly involve @mariadalfonso and @igv4321 in the discussion from HCAL side.
ah, good catch! Indeed the SOI
is missing in the printout.
Which means this line sample.soi()
is not true:
https://github.com/cms-sw/cmssw/blob/master/DataFormats/HcalDigi/interface/QIE11DataFrame.h#L44
- looks like a rare local (RM=readout module or fiber) data corruption, as HB/HE are configured to have SOI=3 (Sample Of Interest, the trigger TS) in the HCAL HB/HE Digi array of 8TS
- is not expected to happen
- skipping this hit is a necessity and it's done in the HCAL local reco
Thanks for the explanations, @abdoulline .
proposed fixes:
CMSSW_14_0_13_patch1
)+hlt
CMSSW_14_0_13_patch1
(containing the fix) deployed online on Aug 5th, 2024 together with new HLT menus (reverting the workaround CMSHLT-3302, collisions menu version v1.4.3 and circulating, cosmic v1.5.2), see HLT report in daily run meeting report of Aug 6th, 2024. No further crashes related to this issues are observed.@cms-sw/reconstruction-l2 @cms-sw/heterogeneous-l2 please consider signing this if there is no other follow up from your area, such that we could close this issue.
I would like to avoid producing the invalid channels at all, but since this is now tracked in https://github.com/cms-sw/cmssw/issues/45651, we can close this issue.
+heterogeneous
+1
This issue is fully signed and ready to be closed.
@cmsbuild, please close
During the recent pp phyiscs fills 9945, 9947 we have deployed the new HLT menu /cdaq/physics/Run2024/2e34/v1.4.1/HLT/V2, and we started to see crashes in processes during the run (elog). The crashes comes from an illegal memory access.
In particular, we quote the errors from run 383830 in this issue.
So far on the offline side crashes were observed using the same error stream files.
From
f3mon
crashes in process shows like,Setting to produce crashes (not necessarily same exact problem) in the streamers
In the
hlt.py
I have tried several settings to reproduce the same cuda memory access error,will result in crash by
Module: alpaka_serial_sync::PFRecHitSoAProducerHCAL:hltParticleFlowRecHitHBHESoASerialSync (crashed)
Output : log_method1.txtAND removed the following paths from
process.schedule = cms.Schedule( ...
block inhlt.py
and run
cmsRun hlt.py 2>&1 | tee logForce_method2.txt
Output : logForce_method2.txt