cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.06k stars 4.24k forks source link

HLT crashes in Run 382461 #45312

Open trtomei opened 1 week ago

trtomei commented 1 week ago

Crashes observed in collisions Run 382461. Error message:

----- Begin Fatal Exception 26-Jun-2024 14:33:41 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 382461 lumi: 2 event: 4698821 stream: 0
   [1] Running path 'DQM_EcalReconstruction_v10'
   [2] Calling method for module EcalUncalibRecHitProducerPortable@alpaka/'hltEcalUncalibRecHitSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_9_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_9_MULTIARCHS-b\
uild/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/kernel/TaskKernelGpuUni\
formCudaHipRt.hpp(259) 'TApi::setDevice(queue.m_spQueueImpl->m_dev.getNativeHandle())' A previous API call (not th\
is one) set the error  : 'cudaErrorInvalidConfiguration': 'invalid configuration argument'!
----- End Fatal Exception -------------------------------------------------

Reproducer:

#!/bin/bash -ex

# CMSSW_14_0_9_patch1_MULTIARCHS

hltGetConfiguration run:382461 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input \
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000928.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000929.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000930.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000931.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000932.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000933.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000934.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000935.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000936.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000937.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000938.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000939.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000940.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000941.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000942.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000943.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000944.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000945.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000946.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000947.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000948.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000949.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000950.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000951.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000952.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000953.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000954.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000955.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000956.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000957.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000958.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000959.root > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Notice that this run has no ECAL barrel, but part of the endcap. @fwyzard has noticed that this is probably related: the protection we implemented for empty ECAL events was on the total size, but there is one kernel that is barrel-only.

Best regards, Thiago (for FOG)

cmsbuild commented 1 week ago

cms-bot internal usage

cmsbuild commented 1 week ago

A new Issue was created by @trtomei.

@Dr15Jones, @antoniovilela, @makortel, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 1 week ago

assign hlt, reconstruction, heterogeneous

mmusich commented 1 week ago

type ecal

mmusich commented 1 week ago

@cms-sw/ecal-dpg-l2 FYI

cmsbuild commented 1 week ago

New categories assigned: hlt,reconstruction,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

fwyzard commented 1 week ago

Should be fixed by #45311 (14.1.x) / #45313 (14.0.x) / #45314 (14.0.9-patchX).

mmusich commented 1 week ago

FWIW I confirm that:

cmsrel CMSSW_14_0_9_patch1_MULTIARCHS
cd CMSSW_14_0_9_patch1_MULTIARCHS/src/
git cms-init
cmsenv
git cms-addpkg RecoLocalCalo/EcalRecProducers
git remote add fwyzard git@github.com:fwyzard/cmssw.git; git fetch fwyzard
git cherry-pick d0f844fb548ac5bd7f8ee6b5daa6476809cb4033
scram b -j 20

tested with the reproducer at https://github.com/cms-sw/cmssw/issues/45312#issue-2375234733 leads to no crashes.

mmusich commented 5 days ago

Solutions proposed all merged:

mmusich commented 5 days ago

+hlt

jfernan2 commented 1 day ago

+1