cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

HLT crash in run-367906 (`sistrip::FEDBuffer::findChannels()`) #41786

Open missirol opened 1 year ago

missirol commented 1 year ago

In run-367906 (pp collisions), DAQ reported 1 CMSSW crash at HLT (release: CMSSW_13_0_6) [link to HLT elog].

The stack trace is attached (f3mon_run367906.txt). A piece of stack trace which is possibly relevant is in [1].

The corresponding error-stream files are available, but first attempts to reproduce the crashes offline failed (tried on "Hilton" HLT node).

The recipe used for those failed attempts is adapted in [2] to be valid for lxplus and lxplus-gpu.

FYI: @cms-sw/hlt-l2 @silviodonato @fwyzard @mzarucki @trtomei

[1]

msgtime:2023-05-24 22:37:12
doc_type:cmsswlog
date:2023-05-24T20:37:12.106Z
run:367906
host:fu-c2b03-18-01
pid:2793118
doctype:stacktrace
severity:FATAL
severityVal:4
instance:global
lexicalId:549852445
message:A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Wed May 24 22:36:52 CEST 2023

(..)

Thread 6 (Thread 0x7fe97ea4f700 (LWP 2794125) "cmsRun"):
#0  0x00007fe9f3d60a71 in poll () from /lib64/libc.so.6
#1  0x00007fe9eac9846f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2  0x00007fe9eac63b6c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  0x00007fe9eac6433b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fe990ee5092 in sistrip::FEDBuffer::findChannels() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libEventFilterSiStripRawToDigi.so
#6  0x00007fe990f5a21e in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripCluste\
rizerPlugins.so
#7  0x00007fe9940a04bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007fe9940a08a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007fe9940a30f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw\
/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007fe99400e347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_\
amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007fe8f21a01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<T\
empTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007fe8f219338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&\
) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007fe8f2196846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms\
/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007fe8f2150263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_a\
md64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007fe8f2151ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007fe9f67ad95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gc\
c11/libFWCoreFramework.so
#17 0x00007fe9f6794072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCo\
reFramework.so
#18 0x00007fe9f67206da in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm:\
:EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /opt/offline/el8_amd64_gcc11/c\
ms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#19 0x00007fe9f6720b88 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#20 0x00007fe9f6475f79 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCore\
Concurrency.so
#21 0x00007fe9f4ef2304 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7fe82e94ab00, waiter=..., this=0x7fe9efd53780) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_\
2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#22 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7fe9efd53780) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-bui\
ld/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#23 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f\
6c08f7b1/tbb-v2021.8.0/src/tbb/arena.cpp:137
#24 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6\
c08f7b1/tbb-v2021.8.0/src/tbb/market.cpp:599
#25 0x00007fe9f4ef44c6 in tbb::detail::r1::rml::private_worker::run (this=0x7fe9efd30100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb\
5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#26 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fe9efd30100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6\
d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#27 0x00007fe9f403e17a in start_thread () from /lib64/libpthread.so.0
#28 0x00007fe9f3d6bdf3 in clone () from /lib64/libc.so.6

(..)

Current Modules:
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates (crashed)
Module: CkfTrackCandidateMaker:hltMuCkfTrackCandidates
Module: PFBlockProducer:hltParticleFlowBlockForDisplTaus
Module: PFBlockProducer:hltParticleFlowBlock
Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates
Module: PFClusterProducer:hltParticleFlowClusterHBHE
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: HcalDigisProducerGPU:hltHcalDigisGPU
Module: none
Module: BeamSpotToCUDA:hltOnlineBeamSpotToGPU
Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
Module: none
Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL
Module: none
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: HcalCPURecHitsProducer:hltHbherecoFromGPU
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: PFRecHitProducer:hltParticleFlowRecHitPSUnseeded
Module: PixelTrackProducerFromSoAPhase1:hltPixelTracks
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: none
Module: none
Module: SiPixelRecHitCUDAPhase1:hltSiPixelRecHitsGPU
Module: SiPixelRecHitFromCUDAPhase1:hltSiPixelRecHitsFromGPU
Module: HBHERecHitProducerGPU:hltHbherecoGPU
Module: EcalUncalibRecHitProducerGPU:hltEcalUncalibRecHitGPU
Module: FastjetJetProducer:hltAK4CaloJets
Module: CAHitNtupletCUDAPhase1:hltPixelTracksGPU
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: SiPixelDigisSoAFromCUDA:hltSiPixelDigisSoA
Module: PFBlockProducer:hltParticleFlowBlockCPUOnly
A fatal system signal has occurred: segmentation violation

[2]

#!/bin/bash

# cmsrel CMSSW_13_0_6
# cd CMSSW_13_0_6/src
# cmsenv
# # save this file as test.sh
# chmod u+x test.sh
# ./test.sh 367906 4 # runNumber nThreads

[ $# -eq 2 ] || exit 1

RUNNUM="${1}"
NUMTHREADS="${2}"

ERRDIR=/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream
RUNDIR="${ERRDIR}"/run"${RUNNUM}"

for dirPath in $(ls -d "${RUNDIR}"*); do
  # require at least one non-empty FRD file
  [ $(cd "${dirPath}" ; find -maxdepth 1 -size +0 | grep .raw | wc -l) -gt 0 ] || continue
  runNumber="${dirPath: -6}"
  JOBTAG=test_run"${runNumber}"
  HLTMENU="--runNumber ${runNumber}"
  hltConfigFromDB ${HLTMENU} > "${JOBTAG}".py
  cat <<EOF >> "${JOBTAG}".py
process.options.numberOfThreads = ${NUMTHREADS}
process.options.numberOfStreams = 0
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
del process.PrescaleService
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
import os
import glob
process.source.fileListMode = True
process.source.fileNames = sorted([foo for foo in glob.glob("${dirPath}/*raw") if os.path.getsize(foo) > 0])
process.EvFDaqDirector.buBaseDir = "${ERRDIR}"
process.EvFDaqDirector.runNumber = ${runNumber}
process.hltDQMFileSaverPB.runNumber = ${runNumber}
# remove paths containing OutputModules
streamPaths = [pathName for pathName in process.finalpaths_()]
for foo in streamPaths:
    process.__delattr__(foo)
EOF
  rm -rf run"${runNumber}"
  mkdir run"${runNumber}"
  echo "run${runNumber} .."
  cmsRun "${JOBTAG}".py &> "${JOBTAG}".log
  echo "run${runNumber} .. done (exit code: $?)"
  unset runNumber
done
unset dirPath
cmsbuild commented 1 year ago

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

missirol commented 1 year ago

assign hlt

(I let others assign to other groups, if needed.)

cmsbuild commented 1 year ago

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

missirol commented 1 year ago

The corresponding error-stream files are available, but first attempts to reproduce the crashes offline failed (tried on Hilton machine).

This is another instance of recent HLT crashes that I can't reproduce offline (see for example #40174, #41741 and #41742).

This time I can also include the full log of the CMSSW job that crashed (see [1]), but I don't know if that helps.

@smorovic , is it possible to draw any conclusions comparing the log of the CMSSW job [1] and the content of the error-stream files [2] ?

[1] old_hlt_run367906_pid2793118.log

[2] /eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run367906/

smorovic commented 1 year ago

Event IDs in two raw files:

run367906_ls0056_index000213_fu-c2b03-18-01_pid2793118.raw
128082587 - 128091658

run367906_ls0056_index000236_fu-c2b03-18-01_pid2793118.raw
128183442 - 128186805

Last message in the log is from one of previous events (file):

%MSG-e TrajectoryNotPosDef:   TrackProducer:hltL3NoFiltersTkTracksFromL2IOHitNoVtx 24-May-2023 22:36:51 CEST  Run: 367906 Event:  127979616
Trajectory covariance is not positive-definite
%MSG

Timestamps of last few files appearing locally at hltd for that process (last 3).

INFO:2023-05-24 22:36:49 - processIndexFile - RUN:367906 - run367906_ls0056_index000189_pid2793118.jsn

INFO:2023-05-24 22:36:51 - processIndexFile - RUN:367906 - run367906_ls0056_index000213_pid2793118.jsn
INFO:2023-05-24 22:36:52 - processIndexFile - RUN:367906 - run367906_ls0056_index000236_pid2793118.jsn
INFO:2023-05-24 22:37:04 - processCRASHfile - RUN:367906 - 'run367906_ls0000_crash_pid2793118.jsn' with errcode: -11
INFO:2023-05-24 22:37:04 - processCRASHFile - RUN:367906 - inputFileList: run367906_ls0056_index000213_fu-c2b03-18-01_pid2793118.raw,run367906_ls0056_index000236_fu-c2b03-18-01_pid2793118.raw

However, this looks ok. Last two open files by the process were also saved, older ones were alread handled and closed. Source keeps up to 2 files open and buffered at the time.

For the crash, there is no information of event ID (only for Exception this is known).

makortel commented 1 year ago

assign reconstruction

FYI @cms-sw/tracking-pog-l2

cmsbuild commented 1 year ago

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

Possibly incidental, but there are two other threads in StMeasurementDetSet::getDetSet(int) at the time of the crash

Thread 36 (Thread 0x7fe8a65ff700 (LWP 2794392) "cmsRun"):
#2  0x00007fe9eac60ed0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fe990f58e90 in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#5  0x00007fe9940a04bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6  0x00007fe9940a08a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7  0x00007fe9940a30f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007fe99400e347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#9  0x00007fe8f21a01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007fe8f219338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007fe8f2196846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007fe8f2150263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007fe8f2151ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#14 0x00007fe9f67ad95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#15 0x00007fe9f6794072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 21 (Thread 0x7fe91dbfe700 (LWP 2794140) "cmsRun"):
#2  0x00007fe9eac60ed0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fe9940a0480 in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#5  0x00007fe9940a08a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6  0x00007fe9940a30f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7  0x00007fe99400e347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#8  0x00007fe8f21a01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#9  0x00007fe8f219338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007fe8f2196846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007fe8f2150263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#12 0x00007fe8f2151ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007fe9f67ad95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#14 0x00007fe9f6794072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
makortel commented 1 year ago

So threads 36 and 6 (crashing one) are operating on the same StMeasurementDetSet object (address 0x00007fe9940a04bd). The code of StMeasurementDetSet::detSet() and StMeasurementDetSet::getDetSet() are technically not thread safe https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h#L207-L211 https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h#L230-L241 https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h#L250-L256

I'm assuming the detIndex_ does not change during the event processing, the elements of empty_ and ready_ are accessed and modified without any protection.

On a cursory look the edmNew::DetSet<SiStripCluster>::set() (called on line 232 above) looks like it would be thread safe. Both threads end up calling ClusterFiller::fill(), but it could be different elements of i.

Another possible thread-safety problem is in edmNew::DetSetVector<T>::update() https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/DataFormats/Common/interface/DetSetVectorNew.h#L634-L649 Here the m_getter is defined as https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/DataFormats/Common/interface/DetSetVectorNew.h#L88 but in practice is used as pointer to Getter which is defined as https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/DataFormats/Common/interface/DetSetVectorNew.h#L164 and the LazyGetter<T>::fill() is not defined as const! https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/DataFormats/Common/interface/DetSetVectorNew.h#L608-L614 So if the concrete LazyGetter<T>::fill() is not thread-safe, it could cause problems. In this case the concrete LazyGetter<T> is ClusterFiller https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/RecoLocalTracker/SiStripClusterizer/plugins/ClustersFromRawProducer.cc#L327 (which I haven't digested yet)

Note that despite of all I wrote above, I can't tell from the stack trace if the problem is really in thread safety or something else.

makortel commented 1 year ago

the LazyGetter<T>::fill() is not defined as const!

This part is now addressed in https://github.com/cms-sw/cmssw/pull/41853 . It helped me to reach conclusion that the https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/RecoLocalTracker/SiStripClusterizer/plugins/ClustersFromRawProducer.cc#L327 looks like it would be thread safe.

makortel commented 1 year ago

The code of StMeasurementDetSet::detSet() and StMeasurementDetSet::getDetSet() are technically not thread safe

The race condition mentioned above is fixed in https://github.com/cms-sw/cmssw/pull/41872. I'm not convinced though it would be the full cause of the crash. Idealistically the race condition would only lead to edmNew::DetSet<SiStripCluster>::set() to be called more than needed, but strictly speaking a race condition leads to undefined behavior so who knows.

missirol commented 1 year ago

Thanks for the suggested fix, @makortel !

makortel commented 1 year ago

Thanks for the suggested fix

@missirol Do you want it backported to 13_0_X? (since it is unclear whether is plays a role in the crash)

missirol commented 1 year ago

If it's clear that it is a fix (even partial), I would be in favor of backporting it, since we will still use 13_0_X online for a while. If it helps, I can prepare the backports.

makortel commented 1 year ago

If it's clear that it is a fix (even partial), I would be in favor of backporting it, since we will still use 13_0_X online for a while.

Thanks, I'll prepare the backports after the review of https://github.com/cms-sw/cmssw/pull/41872 completes (in the current form it is easily cherry-pickable).

dan131riley commented 1 year ago

As long as we're looking at DetSetNew, we're getting with some frequency DetSetNew assertion failures on aarch64

/data/cmsbld/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/95e24eec79ed42decc0c70dcac7a0f7d/opt/cmssw/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/src/DataFormats/Common/interface/DetSetNew.h:86: const data_type* edmNew::DetSet<T>::data() const [with T = SiStripCluster; edmNew::DetSet<T>::data_type = SiStripCluster]: Assertion `m_data' failed.

from here: https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/DataFormats/Common/interface/DetSetNew.h#L84-L88

The test at line 85 looks to be wrong--using a bitwise OR instead of logical, and m_offset is initialized to -1. There's probably also a race condition, but I haven't stared at it long enough yet.

Stack trace:

Thread 3 (Thread 0x400086359260 (LWP 2601823) "cmsRun"):
#8  0x00004000385bfc18 in __assert_fail () from /lib64/libc.so.6
#9  0x0000400063adda58 in edmNew::DetSet<SiStripCluster>::data() const [clone .part.0] [clone .lto_priv.0] () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x0000400063ae790c in TkStripMeasurementDet::recHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, std::vector<float, std::allocator<float> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#11 0x0000400063ae7c38 in TkStripMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#12 0x0000400063b8723c in LayerMeasurements::measurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-05-28-2300/lib/el8_aarch64_gcc11/libTrackingToolsMeasurementDet.so
#13 0x00004000a5fa57bc in MuonCkfTrajectoryBuilder::collectMeasurement(DetLayer const*, std::vector<DetLayer const*, std::allocator<DetLayer const*> > const&, TrajectoryStateOnSurface const&, std::vector<TrajectoryMeasurement, std::allocator<TrajectoryMeasurement> >&, int&, Propagator const*) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoMuonL3TrackFinder.so
#14 0x00004000a5fa743c in MuonCkfTrajectoryBuilder::findCompatibleMeasurements(TrajectorySeed const&, TempTrajectory const&, std::vector<TrajectoryMeasurement, std::allocator<TrajectoryMeasurement> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoMuonL3TrackFinder.so
#15 0x00004000a5f380cc in CkfTrajectoryBuilder::limitedCandidates(std::shared_ptr<TrajectorySeed const> const&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<Trajectory, std::allocator<Trajectory> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00004000a5f392a8 in CkfTrajectoryBuilder::limitedCandidates(TrajectorySeed const&, TempTrajectory&, std::vector<Trajectory, std::allocator<Trajectory> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#17 0x00004000a5f394dc in CkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#18 0x00004000a5f2d7fc in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#19 0x00004000a5f2edc4 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
makortel commented 1 year ago

The test at line 85 looks to be wrong--using a bitwise OR instead of logical, and m_offset is initialized to -1

I agree (especially on the m_offset check should be against -1). Could you make a PR?

There's probably also a race condition

At least the code has https://github.com/cms-sw/cmssw/blob/8617c803c50fc6d37fc77b50282db56e2ce1db1d/RecoTracker/MeasurementDet/plugins/TkStripMeasurementDet.cc#L39-L43 https://github.com/cms-sw/cmssw/blob/8617c803c50fc6d37fc77b50282db56e2ce1db1d/RecoTracker/MeasurementDet/plugins/TkStripMeasurementDet.h#L94 which ends up calling https://github.com/cms-sw/cmssw/blob/8617c803c50fc6d37fc77b50282db56e2ce1db1d/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h#L178 which is part of the race condition I'm trying to fix in https://github.com/cms-sw/cmssw/pull/41872 (assuming the stack trace is from an HLT job that does the on-demand strip unpacking and clustering; if not, the cause is likely something else)

missirol commented 1 year ago

(assuming the stack trace is from an HLT job that does the on-demand strip unpacking and clustering

I think this is the case, as the config had

process.hltSiStripRawToClustersFacility = cms.EDProducer( "SiStripClusterizerFromRaw",
    onDemand = cms.bool( True ),
[..]
makortel commented 1 year ago

(assuming the stack trace is from an HLT job that does the on-demand strip unpacking and clustering

I think this is the case, as the config had

I meant Dan's stack trace on the assertion failure on aarch64 (sorry for being unclear).

makortel commented 1 year ago

If it's clear that it is a fix (even partial), I would be in favor of backporting it, since we will still use 13_0_X online for a while.

Thanks, I'll prepare the backports after the review of #41872 completes (in the current form it is easily cherry-pickable).

The backports are in https://github.com/cms-sw/cmssw/pull/41909 (13_1_X) and https://github.com/cms-sw/cmssw/pull/41910 (13_0_X)

missirol commented 1 year ago

Reporting another HLT crash which may be related to this issue.

(..)

Current Modules: Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates (crashed) Module: CkfTrackCandidateMaker:hltIter2PFlowCkfTrackCandidatesForDisplaced Module: HcalRawToDigi:hltHcalDigis Module: RecoTauProducer:hltHpsCombinatoricRecoTaus Module: CkfTrackCandidateMaker:hltIterL3OIGlbDisplacedTrackCandidates Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracksForDisplaced Module: L2MuonProducer:hltL2CosmicMuons Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded Module: EcalUncalibRecHitProducer:hltEcalUncalibRecHitCPUOnly Module: PFClusterProducer:hltParticleFlowClusterPSUnseeded Module: PFBlockProducer:hltParticleFlowBlockForTaus Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks Module: CkfTrackCandidateMaker:hltMuCkfTrackCandidates Module: HLTL1TSeed:hltL1sDoubleEGXer1p2dRMaxY Module: LightPFTrackProducer:hltLightPFTracks Module: PFClusterProducer:hltParticleFlowClusterPSUnseeded Module: HitPairEDProducer:hltElePixelHitDoubletsUnseeded Module: none Module: FastjetJetProducer:hltAK4CaloJetsPF Module: PathStatusInserter:HLT_CaloMET350_NotCleaned_v8 Module: PFBlockProducer:hltParticleFlowBlockForDisplTaus Module: CAHitNtupletCUDAPhase1:hltPixelTracksCPUOnly Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets Module: PFClusterProducer:hltParticleFlowClusterHBHE Module: CSCRecHitDProducer:hltCsc2DRecHits Module: PFBlockProducer:hltParticleFlowBlock Module: RecoTauJetRegionProducer:hltTauPFJets08Region Module: none Module: SiPixelClusterProducer:hltSiPixelClustersRegForDisplaced Module: PFClusterProducer:hltParticleFlowClusterHBHE A fatal system signal has occurred: segmentation violation

missirol commented 1 year ago

Reporting another HLT crash which may be related to this issue.

(..)

Current Modules: Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly (crashed) Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks Module: CorrectedECALPFClusterProducer:hltParticleFlowClusterECALUnseeded Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets Module: CkfTrajectoryMaker:hltL3TrackCandidateFromL2IOHit Module: MuonIdProducer:hltGlbTrkMuonsLowPtIter01Merge Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates Module: FastjetJetProducer:hltAK4PixelOnlyPFJets Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: TrackProducer:hltIter0IterL3FromL1MuonCtfWithMaterialTracks Module: HLTL1TSeed:hltL1sTripleMuOpen53p52UpsilonMuon Module: DeepTauId:hltHpsPFTauDeepTauProducerForVBFIsoTau Module: HLTL1TSeed:hltL1VBFIsoEG Module: SeedCombiner:hltElePixelSeedsCombined Module: CorrectedCaloJetProducer:hltAK4CaloJetsCorrected Module: MuonIdProducer:hltMuonsForDisplTau Module: GlobalEvFOutputModule:hltOutputCalibration Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates Module: FastjetJetProducer:hltAK4CaloJets Module: FastjetJetProducer:hltAK8CaloJets Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets Module: CaloTowersCreator:hltTowerMakerForAll Module: none Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates A fatal system signal has occurred: segmentation violation

makortel commented 1 year ago

Extracting more stack trace from https://github.com/cms-sw/cmssw/issues/41786#issuecomment-1586317712

Thread 21 (Thread 0x7f9f003fc700 (LWP 1659889) "cmsRun"):
#2  0x00007f9fd21eded0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f9f794fc41a in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#5  0x00007f9f794fc8a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6  0x00007f9f794ff0f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7  0x00007f9f7946a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#8  0x00007f9ed7df21b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#9  0x00007f9ed7de538d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007f9ed7de8846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007f9ed7da2263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#12 0x00007f9ed7da3ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007f9fdbbd095d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#14 0x00007f9fdbbb7072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 14 (Thread 0x7f9f03bff700 (LWP 1659882) "cmsRun"):
#3  0x00007f9fd21f133b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f9e26dcff20 in ?? ()
#6  0x00007f9f763b6216 in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#7  0x00007f9f794fc4bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007f9f7950eb28 in TkStripMeasurementDet::recHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, std::vector<float, std::allocator<float> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007f9f7950ef0d in TkStripMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f9f7946a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007f9ed7df21b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f9ed7de538d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007f9ed7de8846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007f9ed7da2263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007f9ed7da3ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007f9fdbbd095d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f9fdbbb7072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
makortel commented 1 year ago

In https://github.com/cms-sw/cmssw/issues/41786#issuecomment-1586320435

only one thread was in StMeasurementDetSet::getDetSet(), making the stack trace different from the earlier ones. Under the "race condition somewhere in call chain" hypothesis the closes match would be

Thread 36 (Thread 0x7f3a813ff700 (LWP 2036162) "cmsRun"):
#2  0x00007f3bc7f2ded0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f3b6f22418d in SiStripRecHit2D& std::vector<SiStripRecHit2D, std::allocator<SiStripRecHit2D> >::emplace_back<Point3DBase<float, LocalTag> const&, LocalError const&, GeomDet const&, edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> const&>(Point3DBase<float, LocalTag> const&, LocalError const&, GeomDet const&, edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> const&) [clone .isra.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#5  0x00007f3b6f224c74 in bool TkStripMeasurementDet::filteredRecHits<edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> >(edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> const&, StripCPE::AlgoParam const&, TrajectoryStateOnSurface const&, MeasurementEstimator const&, std::vector<bool, std::allocator<bool> > const&, std::vector<SiStripRecHit2D, std::allocator<SiStripRecHit2D> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6  0x00007f3b6f22de80 in TkStripMeasurementDet::simpleRecHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<SiStripRecHit2D, std::allocator<SiStripRecHit2D> >&) const [clone .isra.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7  0x00007f3b6f21f15d in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007f3b6f18a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#9  0x00007f3acd30f1b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007f3acd30238d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007f3acd305846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f3acd2bf263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007f3acd2c0ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#14 0x00007f3bd192d95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#15 0x00007f3bd1914072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 16 (Thread 0x7f3af7dfc700 (LWP 2035972) "cmsRun"):
#3  0x00007f3bc7f3133b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f3b6c0610f1 in sistrip::FEDBuffer::findChannels() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libEventFilterSiStripRawToDigi.so
#6  0x00007f3b6c0d621e in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#7  0x00007f3b6f21c4bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007f3b6f21c8a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007f3b6f21f0f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f3b6f18a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007f3acd30f1b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f3acd30238d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007f3acd305846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007f3acd2bf263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007f3acd2c0ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007f3bd192d95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f3bd1914072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so

On the other hand, this observation supports my earlier hunch of the race condition in StMeasurementDetSet not being the full cause of the crash in sistrip::FEDBuffer::findChannels() (https://github.com/cms-sw/cmssw/issues/41786#issuecomment-1576615773).

makortel commented 1 year ago

@Dr15Jones pointed out that after https://github.com/cms-sw/cmssw/pull/41872 the StMeasurementDetSet::getSet() still has a race condition in the assignment https://github.com/cms-sw/cmssw/blob/e18c96dcf1c5d76ebcb075dc7b3446e92796ad6c/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h#L240-L241

makortel commented 1 year ago

StMeasurementDetSet::getSet() still has a race condition in the assignment

Fix proposed in https://github.com/cms-sw/cmssw/pull/41936 (to be backported to 13_0_X as well)

missirol commented 1 year ago

The fixes in https://github.com/cms-sw/cmssw/pull/41872 and https://github.com/cms-sw/cmssw/pull/41936 were integrated and backported, and CMSSW_13_0_9 includes both. (Thanks for that !)

After HLT deployed CMSSW_13_0_9 online, we saw a runtime crash which looks similar to the ones discussed in this issue. We can share the corresponding error-stream file once available, if that helps.

(..)

Thread 10 (Thread 0x7f5b987fe700 (LWP 3825123) "cmsRun"):

0 0x00007f5c10ae3a71 in poll () from /lib64/libc.so.6

1 0x00007f5c079d846f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so

2 0x00007f5c079a3b6c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so

3 0x00007f5c079a433b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so

4

5 0x00007f5a566562b0 in ?? ()

6 0x00007f5baee67026 in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so

7 0x00007f5bb1fae355 in StMeasurementDetSet::detSet(int) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so

8 0x00007f5bb1fc024c in TkStripMeasurementDet::recHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<std::shared_ptr, std::allocator<std::shared_ptr > >&, std::vector<float, std::allocator >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so

9 0x00007f5bb1fc091d in TkStripMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so

10 0x00007f5bb1f1c347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so

11 0x00007f5b38da01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const, Propagator const, bool, std::vector<TempTrajectory, std::allocator >&, std::vector<TempTrajectory, std::allocator >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so

12 0x00007f5b38d9338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const, Propagator const, bool, std::vector<TempTrajectory, std::allocator >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so

13 0x00007f5b38d96846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so

14 0x00007f5b38d50263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so

15 0x00007f5b38d51ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so

16 0x00007f5c1353095d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry, edm::ModuleCallingContext const) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so

17 0x00007f5c13517072 in edm::WorkerT::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so

18 0x00007f5c134a36da in std::exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so

19 0x00007f5c134a3b88 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so

20 0x00007f5c131f8f79 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreConcurrency.so

21 0x00007f5c11c75304 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f5abcf5af00, waiter=..., this=0x7f5c0b9f3a00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322

22 tbb::detail::r1::task_dispatcher::local_wait_for_all (t=0x0, waiter=..., this=0x7f5c0b9f3a00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458

23 tbb::detail::r1::arena::process (tls=..., this=) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/arena.cpp:137

24 tbb::detail::r1::market::process (this=, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/market.cpp:599

25 0x00007f5c11c774c6 in tbb::detail::r1::rml::private_worker::run (this=0x7f5c0b9e7d80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:271

26 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f5c0b9e7d80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:221

27 0x00007f5c10dc117a in start_thread () from /lib64/libpthread.so.0

28 0x00007f5c10aeedf3 in clone () from /lib64/libc.so.6

(..)

Current Modules: Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly (crashed) Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: PFBlockProducer:hltParticleFlowBlock Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracksCPUOnly Module: PFClusterProducer:hltParticleFlowClusterHBHE Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks Module: GlobalEvFOutputModule:hltOutputPhysicsHLTPhysics2 Module: CorrectedECALPFClusterProducer:hltParticleFlowClusterECALUnseeded Module: ElectronNHitSeedProducer:hltEgammaElectronPixelSeedsUnseeded Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded Module: RecoTauProducer:hltHpsCombinatoricRecoTausDispl Module: MuonHLTSeedMVAClassifier:hltIter0IterL3FromL1MuonPixelSeedsFromPixelTracksFiltered Module: TrackProducer:hltIter0IterL3FromL1MuonCtfWithMaterialTracks Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL Module: none Module: PFClusterProducer:hltParticleFlowClusterHBHE Module: PFRecHitProducer:hltParticleFlowRecHitHF Module: GlobalEvFOutputModule:hltOutputParkingDoubleMuonLowMass3 Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets Module: PixelTrackProducerFromSoAPhase1:hltPixelTracksFromSoACPUOnly Module: HLTRegionalEcalResonanceFilter:hltAlCaPi0RecHitsFilterEBonlyRegional Module: PFBlockProducer:hltParticleFlowBlockForTaus Module: AlcaPCCEventProducer:hltAlcaPixelClusterCounts Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates Module: HitPairEDProducer:hltElePixelHitDoubletsForTripletsUnseeded Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks Module: GsfTrackProducer:hltEgammaGsfTracksUnseeded Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates Module: MuonHLTSeedMVAClassifier:hltIter0IterL3MuonPixelSeedsFromPixelTracksFiltered Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly A fatal system signal has occurred: segmentation violation

slava77 commented 1 year ago

type tracking

makortel commented 1 year ago

Thanks @missirol for reporting the new stack trace. I didn't see anything obviously related activity in the other threads. I suppose the further investigation should focus on the contents of the fill() function itself (I suspected also earlier) https://github.com/cms-sw/cmssw/blob/127b308a7399436720edaa0b06fa727c9cf1a5a9/RecoLocalTracker/SiStripClusterizer/plugins/ClustersFromRawProducer.cc#L328

missirol commented 1 year ago

@dan131riley , would it be useful to backport #42194 to 13_0_X (and 13_1_X) as part of debugging these online crashes ?

dan131riley commented 1 year ago

@dan131riley , would it be useful to backport #42194 to 13_0_X (and 13_1_X) as part of debugging these online crashes ?

That PR is entirely about reducing false positives, it wouldn't help with the HLT crashes.

dan131riley commented 1 year ago

Naive question: are there circumstances where the FEDRawDataCollection could get released while the event is still in progress? Currently the on-demand getter holds a reference to the FEDRawDataCollection--should it be keeping a Handle to the FEDRawDataCollection instead?

Dr15Jones commented 1 year ago

@dan131riley it is possible to tell the framework to delete a data product early. See process.options.canDeleteEarly for the list of data products that a configuration has marked to be allowed to delete early. I would not expect FEDRawDataCollection to be on that list since it has to remain in the event until the OutputModule.

IF FEDRawDataCollection is marked for delete early, one must also specify any data products which reference (say by holding pointers to or even edm::Ref to the data product) the to be deleted early data product in the configuration parameter

process.options.holdsReferencesToDeleteEarly
fwyzard commented 1 year ago

As far as I can see from a recent configuration (attached: hlt.py.gz), HLT does not perform any early deletion.

dan131riley commented 1 year ago

As far as I can see from a recent configuration (attached: hlt.py.gz), HLT does not perform any early deletion.

Thanks, that all makes sense. I'm having trouble constructing scenarios that could account for the crashes in sistrip::FEDBuffer::findChannels(), so there's some clutching at straws in effect trying to eliminate possibilities.

missirol commented 1 year ago

Adding a belated summary of recent online crashes which might be related to this issue. All the runs below are 2023 pp-collisions runs after run-369870. The CMSSW release used in these runs was CMSSW_13_0_N with N >= 9. So far, these crashes were not reproduced offline. A recipe to try and reproduce is in [*].

Legend: run number, [total number of online crashes] number of crashes possibly related to this issue (based on my naive reading of the attached stack traces).

[*] Recipe tested on lxplus-gpu: https://gist.github.com/missirol/45e9626c967e415ca39d2e86c7d26a4b

# example to run on files from run-370560 with 32 threads and 24 streams
./rerun_hlt_on_error_stream.sh -t 32 -s 24 \
 -i /eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream \
 -r 370560 -o tmp
fwyzard commented 1 year ago

If all the crashes are there since CMSSW_13_0_9, maybe #42033 is related ?

mmusich commented 1 year ago

If all the crashes are there since CMSSW_13_0_9, maybe https://github.com/cms-sw/cmssw/pull/42033 is related ?

I doubt it, since the first report is from May 28th (CMSSW_13_0_6): https://github.com/cms-sw/cmssw/issues/41786#issue-1729457647

fwyzard commented 1 year ago

Ah OK, thanks for pointing this out.