Open missirol opened 1 year ago
A new Issue was created by @missirol Marino Missiroli.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign hlt
(I let others assign to other groups, if needed.)
New categories assigned: hlt
@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks
The corresponding error-stream files are available, but first attempts to reproduce the crashes offline failed (tried on Hilton machine).
This is another instance of recent HLT crashes that I can't reproduce offline (see for example #40174, #41741 and #41742).
This time I can also include the full log of the CMSSW job that crashed (see [1]), but I don't know if that helps.
@smorovic , is it possible to draw any conclusions comparing the log of the CMSSW job [1] and the content of the error-stream files [2] ?
[1] old_hlt_run367906_pid2793118.log
[2] /eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run367906/
Event IDs in two raw files:
run367906_ls0056_index000213_fu-c2b03-18-01_pid2793118.raw
128082587 - 128091658
run367906_ls0056_index000236_fu-c2b03-18-01_pid2793118.raw
128183442 - 128186805
Last message in the log is from one of previous events (file):
%MSG-e TrajectoryNotPosDef: TrackProducer:hltL3NoFiltersTkTracksFromL2IOHitNoVtx 24-May-2023 22:36:51 CEST Run: 367906 Event: 127979616
Trajectory covariance is not positive-definite
%MSG
Timestamps of last few files appearing locally at hltd for that process (last 3).
INFO:2023-05-24 22:36:49 - processIndexFile - RUN:367906 - run367906_ls0056_index000189_pid2793118.jsn
INFO:2023-05-24 22:36:51 - processIndexFile - RUN:367906 - run367906_ls0056_index000213_pid2793118.jsn
INFO:2023-05-24 22:36:52 - processIndexFile - RUN:367906 - run367906_ls0056_index000236_pid2793118.jsn
INFO:2023-05-24 22:37:04 - processCRASHfile - RUN:367906 - 'run367906_ls0000_crash_pid2793118.jsn' with errcode: -11
INFO:2023-05-24 22:37:04 - processCRASHFile - RUN:367906 - inputFileList: run367906_ls0056_index000213_fu-c2b03-18-01_pid2793118.raw,run367906_ls0056_index000236_fu-c2b03-18-01_pid2793118.raw
However, this looks ok. Last two open files by the process were also saved, older ones were alread handled and closed. Source keeps up to 2 files open and buffered at the time.
For the crash, there is no information of event ID (only for Exception this is known).
assign reconstruction
FYI @cms-sw/tracking-pog-l2
New categories assigned: reconstruction
@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks
Possibly incidental, but there are two other threads in StMeasurementDetSet::getDetSet(int)
at the time of the crash
Thread 36 (Thread 0x7fe8a65ff700 (LWP 2794392) "cmsRun"):
#2 0x00007fe9eac60ed0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x00007fe990f58e90 in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#5 0x00007fe9940a04bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6 0x00007fe9940a08a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7 0x00007fe9940a30f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8 0x00007fe99400e347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#9 0x00007fe8f21a01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007fe8f219338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007fe8f2196846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007fe8f2150263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007fe8f2151ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#14 0x00007fe9f67ad95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#15 0x00007fe9f6794072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
Thread 21 (Thread 0x7fe91dbfe700 (LWP 2794140) "cmsRun"):
#2 0x00007fe9eac60ed0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x00007fe9940a0480 in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#5 0x00007fe9940a08a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6 0x00007fe9940a30f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7 0x00007fe99400e347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#8 0x00007fe8f21a01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#9 0x00007fe8f219338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007fe8f2196846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007fe8f2150263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#12 0x00007fe8f2151ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007fe9f67ad95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#14 0x00007fe9f6794072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
So threads 36 and 6 (crashing one) are operating on the same StMeasurementDetSet
object (address 0x00007fe9940a04bd
). The code of StMeasurementDetSet::detSet()
and StMeasurementDetSet::getDetSet()
are technically not thread safe
https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h#L207-L211
https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h#L230-L241
https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h#L250-L256
I'm assuming the detIndex_
does not change during the event processing, the elements of empty_
and ready_
are accessed and modified without any protection.
On a cursory look the edmNew::DetSet<SiStripCluster>::set()
(called on line 232 above) looks like it would be thread safe. Both threads end up calling ClusterFiller::fill()
, but it could be different elements of i
.
Another possible thread-safety problem is in edmNew::DetSetVector<T>::update()
https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/DataFormats/Common/interface/DetSetVectorNew.h#L634-L649
Here the m_getter
is defined as
https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/DataFormats/Common/interface/DetSetVectorNew.h#L88
but in practice is used as pointer to Getter
which is defined as
https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/DataFormats/Common/interface/DetSetVectorNew.h#L164
and the LazyGetter<T>::fill()
is not defined as const
!
https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/DataFormats/Common/interface/DetSetVectorNew.h#L608-L614
So if the concrete LazyGetter<T>::fill()
is not thread-safe, it could cause problems. In this case the concrete LazyGetter<T>
is ClusterFiller
https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/RecoLocalTracker/SiStripClusterizer/plugins/ClustersFromRawProducer.cc#L327
(which I haven't digested yet)
Note that despite of all I wrote above, I can't tell from the stack trace if the problem is really in thread safety or something else.
the
LazyGetter<T>::fill()
is not defined asconst
!
This part is now addressed in https://github.com/cms-sw/cmssw/pull/41853 . It helped me to reach conclusion that the https://github.com/cms-sw/cmssw/blob/1bce7ade2a2d213cf85ce748dc16cd34ed104c0c/RecoLocalTracker/SiStripClusterizer/plugins/ClustersFromRawProducer.cc#L327 looks like it would be thread safe.
The code of
StMeasurementDetSet::detSet()
andStMeasurementDetSet::getDetSet()
are technically not thread safe
The race condition mentioned above is fixed in https://github.com/cms-sw/cmssw/pull/41872. I'm not convinced though it would be the full cause of the crash. Idealistically the race condition would only lead to edmNew::DetSet<SiStripCluster>::set()
to be called more than needed, but strictly speaking a race condition leads to undefined behavior so who knows.
Thanks for the suggested fix, @makortel !
Thanks for the suggested fix
@missirol Do you want it backported to 13_0_X? (since it is unclear whether is plays a role in the crash)
If it's clear that it is a fix (even partial), I would be in favor of backporting it, since we will still use 13_0_X online for a while. If it helps, I can prepare the backports.
If it's clear that it is a fix (even partial), I would be in favor of backporting it, since we will still use 13_0_X online for a while.
Thanks, I'll prepare the backports after the review of https://github.com/cms-sw/cmssw/pull/41872 completes (in the current form it is easily cherry-pickable).
As long as we're looking at DetSetNew, we're getting with some frequency DetSetNew assertion failures on aarch64
/data/cmsbld/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/95e24eec79ed42decc0c70dcac7a0f7d/opt/cmssw/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/src/DataFormats/Common/interface/DetSetNew.h:86: const data_type* edmNew::DetSet<T>::data() const [with T = SiStripCluster; edmNew::DetSet<T>::data_type = SiStripCluster]: Assertion `m_data' failed.
The test at line 85 looks to be wrong--using a bitwise OR instead of logical, and m_offset is initialized to -1. There's probably also a race condition, but I haven't stared at it long enough yet.
Stack trace:
Thread 3 (Thread 0x400086359260 (LWP 2601823) "cmsRun"):
#8 0x00004000385bfc18 in __assert_fail () from /lib64/libc.so.6
#9 0x0000400063adda58 in edmNew::DetSet<SiStripCluster>::data() const [clone .part.0] [clone .lto_priv.0] () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x0000400063ae790c in TkStripMeasurementDet::recHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, std::vector<float, std::allocator<float> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#11 0x0000400063ae7c38 in TkStripMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#12 0x0000400063b8723c in LayerMeasurements::measurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-05-28-2300/lib/el8_aarch64_gcc11/libTrackingToolsMeasurementDet.so
#13 0x00004000a5fa57bc in MuonCkfTrajectoryBuilder::collectMeasurement(DetLayer const*, std::vector<DetLayer const*, std::allocator<DetLayer const*> > const&, TrajectoryStateOnSurface const&, std::vector<TrajectoryMeasurement, std::allocator<TrajectoryMeasurement> >&, int&, Propagator const*) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoMuonL3TrackFinder.so
#14 0x00004000a5fa743c in MuonCkfTrajectoryBuilder::findCompatibleMeasurements(TrajectorySeed const&, TempTrajectory const&, std::vector<TrajectoryMeasurement, std::allocator<TrajectoryMeasurement> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoMuonL3TrackFinder.so
#15 0x00004000a5f380cc in CkfTrajectoryBuilder::limitedCandidates(std::shared_ptr<TrajectorySeed const> const&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<Trajectory, std::allocator<Trajectory> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00004000a5f392a8 in CkfTrajectoryBuilder::limitedCandidates(TrajectorySeed const&, TempTrajectory&, std::vector<Trajectory, std::allocator<Trajectory> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#17 0x00004000a5f394dc in CkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#18 0x00004000a5f2d7fc in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#19 0x00004000a5f2edc4 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
The test at line 85 looks to be wrong--using a bitwise OR instead of logical, and m_offset is initialized to -1
I agree (especially on the m_offset
check should be against -1
). Could you make a PR?
There's probably also a race condition
At least the code has https://github.com/cms-sw/cmssw/blob/8617c803c50fc6d37fc77b50282db56e2ce1db1d/RecoTracker/MeasurementDet/plugins/TkStripMeasurementDet.cc#L39-L43 https://github.com/cms-sw/cmssw/blob/8617c803c50fc6d37fc77b50282db56e2ce1db1d/RecoTracker/MeasurementDet/plugins/TkStripMeasurementDet.h#L94 which ends up calling https://github.com/cms-sw/cmssw/blob/8617c803c50fc6d37fc77b50282db56e2ce1db1d/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h#L178 which is part of the race condition I'm trying to fix in https://github.com/cms-sw/cmssw/pull/41872 (assuming the stack trace is from an HLT job that does the on-demand strip unpacking and clustering; if not, the cause is likely something else)
(assuming the stack trace is from an HLT job that does the on-demand strip unpacking and clustering
I think this is the case, as the config had
process.hltSiStripRawToClustersFacility = cms.EDProducer( "SiStripClusterizerFromRaw",
onDemand = cms.bool( True ),
[..]
(assuming the stack trace is from an HLT job that does the on-demand strip unpacking and clustering
I think this is the case, as the config had
I meant Dan's stack trace on the assertion failure on aarch64 (sorry for being unclear).
If it's clear that it is a fix (even partial), I would be in favor of backporting it, since we will still use 13_0_X online for a while.
Thanks, I'll prepare the backports after the review of #41872 completes (in the current form it is easily cherry-pickable).
The backports are in https://github.com/cms-sw/cmssw/pull/41909 (13_1_X) and https://github.com/cms-sw/cmssw/pull/41910 (13_0_X)
Reporting another HLT crash which may be related to this issue.
CMSSW_13_0_7
#3 0x00007f9fd21f133b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f9e26dcff20 in ?? ()
#6 0x00007f9f763b6216 in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#7 0x00007f9f794fc4bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8 0x00007f9f7950eb28 in TkStripMeasurementDet::recHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, std::vector<float, std::allocator<float> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9 0x00007f9f7950ef0d in TkStripMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f9f7946a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007f9ed7df21b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f9ed7de538d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007f9ed7de8846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007f9ed7da2263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007f9ed7da3ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007f9fdbbd095d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f9fdbbb7072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
(..)
Current Modules: Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates (crashed) Module: CkfTrackCandidateMaker:hltIter2PFlowCkfTrackCandidatesForDisplaced Module: HcalRawToDigi:hltHcalDigis Module: RecoTauProducer:hltHpsCombinatoricRecoTaus Module: CkfTrackCandidateMaker:hltIterL3OIGlbDisplacedTrackCandidates Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracksForDisplaced Module: L2MuonProducer:hltL2CosmicMuons Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded Module: EcalUncalibRecHitProducer:hltEcalUncalibRecHitCPUOnly Module: PFClusterProducer:hltParticleFlowClusterPSUnseeded Module: PFBlockProducer:hltParticleFlowBlockForTaus Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks Module: CkfTrackCandidateMaker:hltMuCkfTrackCandidates Module: HLTL1TSeed:hltL1sDoubleEGXer1p2dRMaxY Module: LightPFTrackProducer:hltLightPFTracks Module: PFClusterProducer:hltParticleFlowClusterPSUnseeded Module: HitPairEDProducer:hltElePixelHitDoubletsUnseeded Module: none Module: FastjetJetProducer:hltAK4CaloJetsPF Module: PathStatusInserter:HLT_CaloMET350_NotCleaned_v8 Module: PFBlockProducer:hltParticleFlowBlockForDisplTaus Module: CAHitNtupletCUDAPhase1:hltPixelTracksCPUOnly Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets Module: PFClusterProducer:hltParticleFlowClusterHBHE Module: CSCRecHitDProducer:hltCsc2DRecHits Module: PFBlockProducer:hltParticleFlowBlock Module: RecoTauJetRegionProducer:hltTauPFJets08Region Module: none Module: SiPixelClusterProducer:hltSiPixelClustersRegForDisplaced Module: PFClusterProducer:hltParticleFlowClusterHBHE A fatal system signal has occurred: segmentation violation
Reporting another HLT crash which may be related to this issue.
CMSSW_13_0_7
#3 0x00007f3bc7f3133b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f3b6c0610f1 in sistrip::FEDBuffer::findChannels() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libEventFilterSiStripRawToDigi.so
#6 0x00007f3b6c0d621e in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#7 0x00007f3b6f21c4bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8 0x00007f3b6f21c8a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9 0x00007f3b6f21f0f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f3b6f18a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007f3acd30f1b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f3acd30238d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007f3acd305846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007f3acd2bf263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007f3acd2c0ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007f3bd192d95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f3bd1914072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
(..)
Current Modules: Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly (crashed) Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks Module: CorrectedECALPFClusterProducer:hltParticleFlowClusterECALUnseeded Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets Module: CkfTrajectoryMaker:hltL3TrackCandidateFromL2IOHit Module: MuonIdProducer:hltGlbTrkMuonsLowPtIter01Merge Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates Module: FastjetJetProducer:hltAK4PixelOnlyPFJets Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: TrackProducer:hltIter0IterL3FromL1MuonCtfWithMaterialTracks Module: HLTL1TSeed:hltL1sTripleMuOpen53p52UpsilonMuon Module: DeepTauId:hltHpsPFTauDeepTauProducerForVBFIsoTau Module: HLTL1TSeed:hltL1VBFIsoEG Module: SeedCombiner:hltElePixelSeedsCombined Module: CorrectedCaloJetProducer:hltAK4CaloJetsCorrected Module: MuonIdProducer:hltMuonsForDisplTau Module: GlobalEvFOutputModule:hltOutputCalibration Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates Module: FastjetJetProducer:hltAK4CaloJets Module: FastjetJetProducer:hltAK8CaloJets Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets Module: CaloTowersCreator:hltTowerMakerForAll Module: none Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates A fatal system signal has occurred: segmentation violation
Extracting more stack trace from https://github.com/cms-sw/cmssw/issues/41786#issuecomment-1586317712
- Full log from DAQ: f3mon_run368566.txt (1st crash in the log)
Thread 21 (Thread 0x7f9f003fc700 (LWP 1659889) "cmsRun"):
#2 0x00007f9fd21eded0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x00007f9f794fc41a in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#5 0x00007f9f794fc8a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6 0x00007f9f794ff0f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7 0x00007f9f7946a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#8 0x00007f9ed7df21b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#9 0x00007f9ed7de538d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007f9ed7de8846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007f9ed7da2263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#12 0x00007f9ed7da3ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007f9fdbbd095d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#14 0x00007f9fdbbb7072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
Thread 14 (Thread 0x7f9f03bff700 (LWP 1659882) "cmsRun"):
#3 0x00007f9fd21f133b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f9e26dcff20 in ?? ()
#6 0x00007f9f763b6216 in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#7 0x00007f9f794fc4bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8 0x00007f9f7950eb28 in TkStripMeasurementDet::recHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, std::vector<float, std::allocator<float> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9 0x00007f9f7950ef0d in TkStripMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f9f7946a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007f9ed7df21b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f9ed7de538d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007f9ed7de8846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007f9ed7da2263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007f9ed7da3ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007f9fdbbd095d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f9fdbbb7072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
In https://github.com/cms-sw/cmssw/issues/41786#issuecomment-1586320435
- Full log from DAQ: f3mon_run368636.txt
only one thread was in StMeasurementDetSet::getDetSet()
, making the stack trace different from the earlier ones. Under the "race condition somewhere in call chain" hypothesis the closes match would be
Thread 36 (Thread 0x7f3a813ff700 (LWP 2036162) "cmsRun"):
#2 0x00007f3bc7f2ded0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x00007f3b6f22418d in SiStripRecHit2D& std::vector<SiStripRecHit2D, std::allocator<SiStripRecHit2D> >::emplace_back<Point3DBase<float, LocalTag> const&, LocalError const&, GeomDet const&, edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> const&>(Point3DBase<float, LocalTag> const&, LocalError const&, GeomDet const&, edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> const&) [clone .isra.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#5 0x00007f3b6f224c74 in bool TkStripMeasurementDet::filteredRecHits<edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> >(edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> const&, StripCPE::AlgoParam const&, TrajectoryStateOnSurface const&, MeasurementEstimator const&, std::vector<bool, std::allocator<bool> > const&, std::vector<SiStripRecHit2D, std::allocator<SiStripRecHit2D> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6 0x00007f3b6f22de80 in TkStripMeasurementDet::simpleRecHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<SiStripRecHit2D, std::allocator<SiStripRecHit2D> >&) const [clone .isra.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7 0x00007f3b6f21f15d in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8 0x00007f3b6f18a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#9 0x00007f3acd30f1b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007f3acd30238d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007f3acd305846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f3acd2bf263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007f3acd2c0ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#14 0x00007f3bd192d95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#15 0x00007f3bd1914072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
Thread 16 (Thread 0x7f3af7dfc700 (LWP 2035972) "cmsRun"):
#3 0x00007f3bc7f3133b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f3b6c0610f1 in sistrip::FEDBuffer::findChannels() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libEventFilterSiStripRawToDigi.so
#6 0x00007f3b6c0d621e in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#7 0x00007f3b6f21c4bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8 0x00007f3b6f21c8a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9 0x00007f3b6f21f0f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f3b6f18a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007f3acd30f1b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f3acd30238d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007f3acd305846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007f3acd2bf263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007f3acd2c0ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007f3bd192d95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f3bd1914072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
On the other hand, this observation supports my earlier hunch of the race condition in StMeasurementDetSet
not being the full cause of the crash in sistrip::FEDBuffer::findChannels()
(https://github.com/cms-sw/cmssw/issues/41786#issuecomment-1576615773).
@Dr15Jones pointed out that after https://github.com/cms-sw/cmssw/pull/41872 the StMeasurementDetSet::getSet()
still has a race condition in the assignment
https://github.com/cms-sw/cmssw/blob/e18c96dcf1c5d76ebcb075dc7b3446e92796ad6c/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h#L240-L241
StMeasurementDetSet::getSet()
still has a race condition in the assignment
Fix proposed in https://github.com/cms-sw/cmssw/pull/41936 (to be backported to 13_0_X as well)
The fixes in https://github.com/cms-sw/cmssw/pull/41872 and https://github.com/cms-sw/cmssw/pull/41936 were integrated and backported, and CMSSW_13_0_9
includes both. (Thanks for that !)
After HLT deployed CMSSW_13_0_9
online, we saw a runtime crash which looks similar to the ones discussed in this issue. We can share the corresponding error-stream file once available, if that helps.
CMSSW_13_0_9
msgtime:2023-06-30 17:51:19
doc_type:cmsswlog
date:2023-06-30T15:51:19.990Z
run:369870
host:fu-c2b05-13-01
pid:3824147
doctype:stacktrace
severity:FATAL
severityVal:4
instance:global
lexicalId:549852445
message:A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Fri Jun 30 17:50:56 CEST 2023
(..)
Thread 10 (Thread 0x7f5b987fe700 (LWP 3825123) "cmsRun"):
(..)
Current Modules: Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly (crashed) Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: PFBlockProducer:hltParticleFlowBlock Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracksCPUOnly Module: PFClusterProducer:hltParticleFlowClusterHBHE Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks Module: GlobalEvFOutputModule:hltOutputPhysicsHLTPhysics2 Module: CorrectedECALPFClusterProducer:hltParticleFlowClusterECALUnseeded Module: ElectronNHitSeedProducer:hltEgammaElectronPixelSeedsUnseeded Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded Module: RecoTauProducer:hltHpsCombinatoricRecoTausDispl Module: MuonHLTSeedMVAClassifier:hltIter0IterL3FromL1MuonPixelSeedsFromPixelTracksFiltered Module: TrackProducer:hltIter0IterL3FromL1MuonCtfWithMaterialTracks Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL Module: none Module: PFClusterProducer:hltParticleFlowClusterHBHE Module: PFRecHitProducer:hltParticleFlowRecHitHF Module: GlobalEvFOutputModule:hltOutputParkingDoubleMuonLowMass3 Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets Module: PixelTrackProducerFromSoAPhase1:hltPixelTracksFromSoACPUOnly Module: HLTRegionalEcalResonanceFilter:hltAlCaPi0RecHitsFilterEBonlyRegional Module: PFBlockProducer:hltParticleFlowBlockForTaus Module: AlcaPCCEventProducer:hltAlcaPixelClusterCounts Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates Module: HitPairEDProducer:hltElePixelHitDoubletsForTripletsUnseeded Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks Module: GsfTrackProducer:hltEgammaGsfTracksUnseeded Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates Module: MuonHLTSeedMVAClassifier:hltIter0IterL3MuonPixelSeedsFromPixelTracksFiltered Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly A fatal system signal has occurred: segmentation violation
type tracking
Thanks @missirol for reporting the new stack trace. I didn't see anything obviously related activity in the other threads. I suppose the further investigation should focus on the contents of the fill()
function itself (I suspected also earlier)
https://github.com/cms-sw/cmssw/blob/127b308a7399436720edaa0b06fa727c9cf1a5a9/RecoLocalTracker/SiStripClusterizer/plugins/ClustersFromRawProducer.cc#L328
@dan131riley , would it be useful to backport #42194 to 13_0_X
(and 13_1_X
) as part of debugging these online crashes ?
@dan131riley , would it be useful to backport #42194 to
13_0_X
(and13_1_X
) as part of debugging these online crashes ?
That PR is entirely about reducing false positives, it wouldn't help with the HLT crashes.
Naive question: are there circumstances where the FEDRawDataCollection could get released while the event is still in progress? Currently the on-demand getter holds a reference to the FEDRawDataCollection--should it be keeping a Handle to the FEDRawDataCollection instead?
@dan131riley it is possible to tell the framework to delete a data product early. See process.options.canDeleteEarly
for the list of data products that a configuration has marked to be allowed to delete early. I would not expect FEDRawDataCollection to be on that list since it has to remain in the event until the OutputModule.
IF FEDRawDataCollection is marked for delete early, one must also specify any data products which reference (say by holding pointers to or even edm::Ref
to the data product) the to be deleted early data product in the configuration parameter
process.options.holdsReferencesToDeleteEarly
As far as I can see from a recent configuration (attached: hlt.py.gz), HLT does not perform any early deletion.
As far as I can see from a recent configuration (attached: hlt.py.gz), HLT does not perform any early deletion.
Thanks, that all makes sense. I'm having trouble constructing scenarios that could account for the crashes in sistrip::FEDBuffer::findChannels()
, so there's some clutching at straws in effect trying to eliminate possibilities.
Adding a belated summary of recent online crashes which might be related to this issue. All the runs below are 2023 pp-collisions runs after run-369870. The CMSSW release used in these runs was CMSSW_13_0_N
with N >= 9
. So far, these crashes were not reproduced offline. A recipe to try and reproduce is in [*].
Legend: run number, [total number of online crashes] number of crashes possibly related to this issue (based on my naive reading of the attached stack traces).
[*] Recipe tested on lxplus-gpu
:
https://gist.github.com/missirol/45e9626c967e415ca39d2e86c7d26a4b
# example to run on files from run-370560 with 32 threads and 24 streams
./rerun_hlt_on_error_stream.sh -t 32 -s 24 \
-i /eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream \
-r 370560 -o tmp
If all the crashes are there since CMSSW_13_0_9, maybe #42033 is related ?
If all the crashes are there since CMSSW_13_0_9, maybe https://github.com/cms-sw/cmssw/pull/42033 is related ?
I doubt it, since the first report is from May 28th (CMSSW_13_0_6
): https://github.com/cms-sw/cmssw/issues/41786#issue-1729457647
Ah OK, thanks for pointing this out.
In run-367906 (pp collisions), DAQ reported 1 CMSSW crash at HLT (release:
CMSSW_13_0_6
) [link to HLT elog].The stack trace is attached (f3mon_run367906.txt). A piece of stack trace which is possibly relevant is in [1].
The corresponding error-stream files are available, but first attempts to reproduce the crashes offline failed (tried on "Hilton" HLT node).
The recipe used for those failed attempts is adapted in [2] to be valid for
lxplus
andlxplus-gpu
.FYI: @cms-sw/hlt-l2 @silviodonato @fwyzard @mzarucki @trtomei
[1]
[2]