cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.33k forks source link

PromptReco failure PromptReco_Run381379_ParkingSingleMuon4 #45162

Closed Dr15Jones closed 1 month ago

Dr15Jones commented 6 months ago

From https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381379-parkingsinglemuon4/42082

----- Begin Fatal Exception 06-Jun-2024 16:58:22 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 381379 lumi: 819 event: 1742750619 stream: 2
   [1] Running path 'write_AOD_step'
   [2] Prefetching for module PoolOutputModule/'write_AOD'
   [3] While reading from source GlobalObjectMapRecord hltGtStage2ObjectMap '' HLT
   [4] Rethrowing an exception that happened on a different read request.
   [5] Processing  Event run: 381379 lumi: 819 event: 1742683577 stream: 4
   [6] Running path 'dqmoffline_step'
   [7] Prefetching for module DQMMessageLogger/'DQMMessageLogger'
   [8] Prefetching for module LogErrorHarvester/'logErrorHarvester'
   [9] Prefetching for module CSCRecHitDProducer/'csc2DRecHits'
   [10] Prefetching for module CSCDCCUnpacker/'muonCSCDigis'
   [11] While reading from source FEDRawDataCollection rawDataCollector '' LHC
   [12] Reading branch FEDRawDataCollection_rawDataCollector__LHC.
Exception Message:
vector::_M_default_append
----- End Fatal Exception -------------------------------------------------

The tarball can be found here:

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/FileReadError/job/WMTaskSpace/cmsRun1 From the logs it seems to crash at event 1742503164. The error is reproducible locally.

cmsbuild commented 6 months ago

cms-bot internal usage

cmsbuild commented 6 months ago

A new Issue was created by @Dr15Jones.

@antoniovilela, @sextonkennedy, @smuzaffar, @makortel, @rappoccio, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 6 months ago

The job can be run by setting up a CMSSW_14_0_7 area, downloading the tarball (which is at /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/FileReadError/a406cf00-00a4-498e-b7e2-9ec39b964fac-216-3-logArchive.tar.gz )

Then after untarring go to directory job/WMTaskSpace/cmsRun1 and then do

cmsRun PSet.py
Dr15Jones commented 6 months ago

There appear to be lots of extraneous exceptions being thrown (and caught) in this job. The first one encountered is

%MSG-e SiStripMonitorTrack:  SiStripMonitorTrack:HLTSiStripMonitorTrack  06-Jun-2024 17:43:09 CEST Run: 381379 Event: 1741662696
ClusterCollection is not valid!!
%MSG
[Switching to Thread 0x7fffa05fe640 (LWP 3001818)]

Thread 7 "cmsRun" hit Catchpoint 1 (exception thrown), 0x00007ffff5ead0f1 in __cxxabiv1::__cxa_throw (obj=0x7ffde5f68b00, tinfo=0x7ffff79a0650 <typeinfo for edm::Exception>,
    dest=0x7ffff796a010 <edm::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
81      ../../../../libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory.
(gdb) where
#0  0x00007ffff5ead0f1 in __cxxabiv1::__cxa_throw (obj=0x7ffde5f68b00, tinfo=0x7ffff79a0650 <typeinfo for edm::Exception>, dest=0x7ffff796a010 <edm::Exception::~Exception()>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007ffff7b7e0b2 in throwInvalidRefFromNullOrInvalidRef(edm::TypeID const&) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libDataFormatsCommon.so
#2  0x00007ffff7b7ed6f in edm::RefCore::tryToGetProductPtr(std::type_info const&, edm::EDProductGetter const*) const [clone .cold] ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libDataFormatsCommon.so
#3  0x00007fffa557aa1a in reco::Track::recHitsBegin() const ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so
#4  0x00007fffa55bd779 in SingleLongTrackProducer::produce(edm::Event&, edm::EventSetup const&) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so
#5  0x00007ffff7e483c1 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libFWCoreFramework.so
#6  0x00007ffff7e2c04e in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libFWCoreFramework.so
#7  0x00007ffff7db9159 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libFWCoreFramework.so
#8  0x00007ffff7db96c4 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libFWCoreFramework.so
#9  0x00007ffff7f3af28 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libFWCoreConcurrency.so
#10 0x00007ffff6f1091b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7ffeafe74400, waiter=..., this=0x7ffff41c3b00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#11 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7ffff41c3b00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#12 tbb::detail::r1::arena::process (tls=..., this=<optimized out>)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/arena.cpp:137
#13 tbb::detail::r1::market::process (this=<optimized out>, j=...)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/market.cpp:599
#14 0x00007ffff6f12ace in tbb::detail::r1::rml::private_worker::run (this=0x7ffff2486f00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#15 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7ffff2486f00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#16 0x00007ffff5a89c02 in start_thread () from /lib64/libc.so.6
#17 0x00007ffff5b0ec40 in clone3 () from /lib64/libc.so.6

Which is caught here https://github.com/cms-sw/cmssw/blob/dbbd44f6792e61b79f46b7f9974eec7cf8e3024b/RecoTracker/FinalTrackSelectors/plugins/SingleLongTrackProducer.cc#L158-L173

which is problematic as the tracks are the generalTracks which are being made in this job and SHOULD have accessible hits!

Dr15Jones commented 6 months ago

assign tracking

Dr15Jones commented 6 months ago

The next group of exceptions come from

#0  0x00007ffff5b9d2f1 in __cxxabiv1::__cxa_throw (obj=0x7ffdca082400, tinfo=0x7ffff79a5628 <typeinfo for cms::Exception>, dest=0x7ffff796ee30 <cms::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007fffc37f8a8d in PerigeeConversions::ftsToPerigeeParameters(FreeTrajectoryState const&, Point3DBase<float, GlobalTag> const&, double&) [clone .cold] ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libTrackingToolsTrajectoryState.so
#2  0x00007fffc3806a5a in TrajectoryStateClosestToPoint::TrajectoryStateClosestToPoint(FreeTrajectoryState const&, Point3DBase<float, GlobalTag> const&) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libTrackingToolsTrajectoryState.so
#3  0x00007fffc38725a5 in TSCPBuilderNoMaterial::operator()(TrajectoryStateOnSurface const&, Point3DBase<float, GlobalTag> const&) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libTrackingToolsPatternTools.so
#4  0x00007fffbe679dd2 in PerigeeLinearizedTrackState::computeJacobians() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexVertexTools.so
#5  0x00007fffbe67a456 in PerigeeLinearizedTrackState::isValid() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexVertexTools.so
#6  0x00007fffbc5ac58f in KalmanVertexUpdator<5u>::positionUpdate(VertexState const&, ReferenceCountingPointer<LinearizedTrackState<5u> >, float, int) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#7  0x00007fffbc5ae20d in KalmanVertexUpdator<5u>::update(CachingVertex<5u> const&, ReferenceCountingPointer<VertexTrack<5u> >, float, int) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#8  0x00007fffbc5ae89a in KalmanVertexUpdator<5u>::add(CachingVertex<5u> const&, ReferenceCountingPointer<VertexTrack<5u> >) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#9  0x00007fffbc5ae90d in KalmanVertexTrackCompatibilityEstimator<5u>::estimateNFittedTrack(CachingVertex<5u> const&, ReferenceCountingPointer<VertexTrack<5u> >) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#10 0x00007fffbc5b023f in KalmanVertexTrackCompatibilityEstimator<5u>::estimate(CachingVertex<5u> const&, ReferenceCountingPointer<VertexTrack<5u> >, unsigned int) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#11 0x00007fffbc5aa80e in KalmanVertexTrackCompatibilityEstimator<5u>::estimate(CachingVertex<5u> const&, ReferenceCountingPointer<LinearizedTrackState<5u> >, unsigned int) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#12 0x00007fffbc5d101c in AdaptiveVertexFitter::reWeightTracks(std::vector<ReferenceCountingPointer<LinearizedTrackState<5u> >, std::allocator<ReferenceCountingPointer<LinearizedTrackState<5u> > > > const&, CachingVertex<5u> const&) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexAdaptiveVertexFit.so
#13 0x00007fffbc5d1e65 in AdaptiveVertexFitter::reWeightTracks(std::vector<ReferenceCountingPointer<VertexTrack<5u> >, std::allocator<ReferenceCountingPointer<VertexTrack<5u> > > > const&, CachingVertex<5u> const&) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexAdaptiveVertexFit.so
#14 0x00007fffbc5d32ed in AdaptiveVertexFitter::fit(std::vector<ReferenceCountingPointer<VertexTrack<5u> >, std::allocator<ReferenceCountingPointer<VertexTrack<5u> > > > const&, VertexState const&, bool) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexAdaptiveVertexFit.so
#15 0x00007fffbc5d46e1 in AdaptiveVertexFitter::vertex(std::vector<reco::TransientTrack, std::allocator<reco::TransientTrack> > const&, Point3DBase<float, GlobalTag> const&) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexAdaptiveVertexFit.so
#16 0x00007fff4035710a in TemplatedInclusiveVertexFinder<edm::View<reco::Candidate>, reco::VertexCompositePtrCandidate>::produce(edm::Event&, edm::EventSetup const&) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginRecoVertexAdaptiveVertexFinderPlugins.so
#17 0x00007ffff7ce1e91 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so

the exception originates here

https://github.com/cms-sw/cmssw/blob/dbbd44f6792e61b79f46b7f9974eec7cf8e3024b/TrackingTools/TrajectoryState/src/PerigeeConversions.cc#L15-L16

and is caught here

https://github.com/cms-sw/cmssw/blob/dbbd44f6792e61b79f46b7f9974eec7cf8e3024b/TrackingTools/TrajectoryState/src/TrajectoryStateClosestToPoint.cc#L8-L23

Dr15Jones commented 6 months ago

assign reconstruction

cmsbuild commented 6 months ago

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 6 months ago

By skipping the first events, I was able to get to the trackback for the exception which ultimately ended the job

#0  0x00007ffff5b9d2f1 in __cxxabiv1::__cxa_throw (obj=0x7ffe9579d1a0, tinfo=0x7ffff5d03190 <typeinfo for std::length_error>, dest=0x7ffff5bb2220 <std::length_error::~length_error()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007ffff5b942d9 in std::__throw_length_error(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib64/libstdc++.so.6
#2  0x00007fffc38c8346 in ROOT::Detail::TCollectionProxyInfo::Pushback<std::vector<unsigned char, std::allocator<unsigned char> > >::resize(void*, unsigned long) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libDataFormatsStdDictionaries.so
#3  0x00007ffff7193701 in void TGenCollectionStreamer::ReadBufferVectorPrimitives<unsigned char>(TBuffer&, void*, TClass const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#4  0x00007ffff7110e09 in TBufferFile::ReadFastArray(void*, TClass const*, int, TMemberStreamer*, TClass const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#5  0x00007ffff735e073 in int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#6  0x00007ffff7211e4c in TStreamerInfoActions::VectorLooper::GenericRead(TBuffer&, void*, void const*, TStreamerInfoActions::TLoopConfiguration const*, TStreamerInfoActions::TConfiguration const*) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#7  0x00007ffff710f5fc in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*, void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#8  0x00007ffff725f38f in int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#9  0x00007ffff7117eae in TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#10 0x00007ffff735cdcc in int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#11 0x00007ffff71de94d in TStreamerInfoActions::GenericReadAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#12 0x00007ffff710fbb5 in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#13 0x00007ffff7873b87 in TBranchElement::ReadLeavesMember(TBuffer&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libTree.so
#14 0x00007ffff786c429 in TBranch::GetEntry(long long, int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libTree.so
#15 0x00007ffff787ed44 in TBranchElement::GetEntry(long long, int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libTree.so
#16 0x00007ffff787ecfd in TBranchElement::GetEntry(long long, int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libTree.so
#17 0x00007fff9d66585c in edm::RootTree::getEntry(TBranch*, long long) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolInput.so
#18 0x00007fff9d64639c in edm::RootDelayedReader::getProduct_(edm::BranchID const&, edm::EDProductGetter const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolInput.so
#19 0x00007ffff7bc111f in edm::DelayedReader::getProduct(edm::BranchID const&, edm::EDProductGetter const*, edm::ModuleCallingContext const*) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x00007ffff7c6a35b in edm::DelayedReaderInputProductResolver::prefetchAsync_(edm::WaitingTaskHolder, edm::Principal const&, bool, edm::ServiceToken const&, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x00007ffff7c6b7cc in edm::DelayedReaderInputProductResolver::prefetchAsync_(edm::WaitingTaskHolder, edm::Principal const&, bool, edm::ServiceToken const&, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const::{lambda()#1}::operator()() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#22 0x00007ffff7c6b918 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::DelayedReaderInputProductResolver::prefetchAsync_(edm::WaitingTaskHolder, edm::Principal const&, bool, edm::ServiceToken const&, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::DelayedReaderInputProductResolver::prefetchAsync_(edm::WaitingTaskHolder, edm::Principal const&, bool, edm::ServiceToken const&, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#23 0x00007ffff7e031d0 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#24 0x00007ffff63fe95b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7fff08c3ec00, waiter=..., this=0x7ffff3963b00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#25 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7ffff3963b00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#26 tbb::detail::r1::arena::process (tls=..., this=<optimized out>)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:137
#27 tbb::detail::r1::market::process (this=<optimized out>, j=...)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/market.cpp:599
#28 0x00007ffff6400b0e in tbb::detail::r1::rml::private_worker::run (this=0x7ffff17e9100)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#29 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7ffff17e9100)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#30 0x00007ffff55341ca in start_thread () from /lib64/libpthread.so.0
#31 0x00007ffff518f8d3 in clone () from /lib64/libc.so.6
Dr15Jones commented 6 months ago

assign root

Dr15Jones commented 6 months ago

@pcanal how can we understand better what happened during the read?

Dr15Jones commented 6 months ago

type root

Dr15Jones commented 6 months ago

type tracking

slava77 commented 6 months ago

Which is caught here

https://github.com/cms-sw/cmssw/blob/dbbd44f6792e61b79f46b7f9974eec7cf8e3024b/RecoTracker/FinalTrackSelectors/plugins/SingleLongTrackProducer.cc#L158-L173

that's just looks like a poorly written code, where try/catch is used instead of checking for trackExtra to be present. Tracks are apparently not pure generalTracks, see https://github.com/cms-sw/cmssw/blob/dbbd44f6792e61b79f46b7f9974eec7cf8e3024b/RecoTracker/FinalTrackSelectors/plugins/SingleLongTrackProducer.cc#L133-L136

a proper copy is made conditionally, while the rest in selTracks is going to be default-constructed reco::Tracks

slava77 commented 6 months ago

@borzari please check https://github.com/cms-sw/cmssw/issues/45162#issuecomment-2153549462 to possibly remove the try/catch pattern related to just acces to track.extra in the track.recHitsBegin() call. It should be a combination of validity checks for extra() and then extra()->recHitsProduct(); by checking isNonnull() && isAvailable() for each, sequentially. This could even be packed into a new helper method ,e.g. bool reco::Track::recHitsOk()

Please clarify if you are available to check this. Thank you.

borzari commented 5 months ago

Hi @slava77

I applied what you suggested in this commit, used the opportunity to remove some duplicated code, and tested it with RelValZMM and RelValTTbar events by comparing the version with try/catch results with the version with the validity check results. Everything worked as intended and no changes to the output were observed, as expected.

Just to clarify two points:

slava77 commented 5 months ago

I couldn't check the validity of the recHitsProduct(). There doesn't seem to be something similar to isNonnull() or isAvailable() for it. However, just checking track.extra() seemed enough. Was it supposed to be like this? Am I missing something about the recHitsProduct()?

I misread the TrackExtraBase; edm::RefCore m_hitCollection; is the one that has isNonnull() and isAvailable(), but it is not publicly exposed.

So, I would add this bool recHitsOk() const {return m_hitCollection.isNonnull() && m_hitCollection.isAvailable();} in TrackExtraBase.h And then in Track.h bool recHitsOk() const {return extra_.isNonnull() && extra_.isAvailable() && extra_->recHitsOk();}

Even though in the current setup a track without an extra is enough, there can still be cases where SingleLongTrackProducer uses input tracks where hits got dropped.

mmusich commented 5 months ago

Tracks are apparently not pure generalTracks, see https://github.com/cms-sw/cmssw/blob/dbbd44f6792e61b79f46b7f9974eec7cf8e3024b/RecoTracker/FinalTrackSelectors/plugins/SingleLongTrackProducer.cc#L133-L136 a proper copy is made conditionally, while the rest in selTracks is going to be default-constructed reco::Tracks

Out of curiosity why is that? Can't the selTracks just contain the tracks we can actually refit?

borzari commented 5 months ago

Tracks are apparently not pure generalTracks, see https://github.com/cms-sw/cmssw/blob/dbbd44f6792e61b79f46b7f9974eec7cf8e3024b/RecoTracker/FinalTrackSelectors/plugins/SingleLongTrackProducer.cc#L133-L136

a proper copy is made conditionally, while the rest in selTracks is going to be default-constructed reco::Tracks

Out of curiosity why is that? Can't the selTracks just contain the tracks we can actually refit?

Hi @mmusich The selTracks collection will only have one track, the one with smallest chiNdof. I also want to check if the rechits and hits from the hitpattern are valid to say that it is a goodTrack that can be used for the shortened tracks pT resolution. Specially because of what @slava77 mentioned here:

Even though in the current setup a track without an extra is enough, there can still be cases where SingleLongTrackProducer uses input tracks where hits got dropped.

The hit checks are to make sure that this track won't have missing layers with measurement, which is not 100% effective as I already showed during the presentations about this topic, but also doesn't impact a lot on the final result because it doesn't happen so often. I wouldn't think changing that part of the code for selTracks to only have tracks that can be refitted to have a large impact on what is going on in the SingleLongTrackProducer or after it, unless it is an extra "safety check" that can be included.

Here I added the suggestions from @slava77. Again, I tested with RelValZMM and RelValTTbar events, and things are working as expected. If you don't have other suggestions, I can open a PR with it and we can continue the discussion there

mmusich commented 5 months ago

@borzari

also want to check if the rechits and hits from the hitpattern are valid to say that it is a goodTrack that can be used for the shortened tracks pT resolution.

Exactly, can't you do that before filling the vector? Default constructed tracks can't be used for refit.

borzari commented 5 months ago

@borzari

also want to check if the rechits and hits from the hitpattern are valid to say that it is a goodTrack that can be used for the shortened tracks pT resolution.

Exactly, can't you do that before filling the vector? Default constructed tracks can't be used for refit.

Alright, so instead of only getting the track with the smallest chiNdof, I also want it to have recHitsOk(), right?

mmusich commented 5 months ago

I also want it to have recHitsOk(), right?

Right, this is what I had in mind.

borzari commented 5 months ago

I also want it to have recHitsOk(), right?

Right, this is what I had in mind.

It didn't work. If I move the validity check from the rechits/hitpattern check to where I select tracks (I did if (chiNdof < fitProb && track.recHitsOk())), I get the message as if I was not checking the tracks:

----- Begin Fatal Exception 08-Jun-2024 19:16:37 CEST-----------------------
An exception of category 'InvalidReference' occurred while
   [0] Processing  Event run: 1 lumi: 76 event: 7503 stream: 6
   [1] Running path 'dqmoffline_step'
   [2] Calling method for module SingleLongTrackProducer/'SingleLongTrackProducer'
Exception Message:
BadRefCore RefCore: Request to resolve a null or invalid reference to a product of type 'std::vector<reco::TrackExtra>' has been detected.
Please modify the calling code to test validity before dereferencing.
----- End Fatal Exception -------------------------------------------------
mmusich commented 5 months ago

I get the message as if I was not checking the tracks:

Isn't track.recHitsOk() checking that the TrackExtra is valid?

borzari commented 5 months ago

Isn't track.recHitsOk() checking that the TrackExtra is valid?

Should be. I implemented it like Slava suggested here

Could it be that, although I am adding only tracks with valid TrackExtra to selTracks, the framework still needs me to check if I am looking at a valid track (that have TrackExtra) from it to check if it has valid hits/hitpattern? I am not sure how the "not valid TrackExtra" exception works, that is why I am asking

Dr15Jones commented 5 months ago

The check I used was

if (track.extra().isAvailable()) {
borzari commented 5 months ago

The check I used was

if (track.extra().isAvailable()) {

Alright @Dr15Jones, but does it happens every time I am using a reco::Track anywhere?

Well, in any case, I would suggest to open a PR with these changes. At least to remove the try/catch pattern.

mmusich commented 5 months ago

get the message as if I was not checking the tracks:

maybe I am missing something, but with https://github.com/CMSTrackingPOG/cmssw/commit/53185493eae82d7fe8e807e9b266491ea51d06f8 on top of https://github.com/borzari/cmssw/commit/95ecc4bb4aa7e811f1f65025c8f08a23a72cf272 I can run this test:

https://github.com/cms-sw/cmssw/blob/4639c105a21c6934798c543c9a7cae72955c9369/DQM/TrackingMonitorSource/test/BuildFile.xml#L2

(even using the whole input file) without crashes.

borzari commented 5 months ago

@mmusich most probably I was missing something. The main differences I see (besides the better organization of the code in the way you wrote), is that I included track.recHitsOk() here in the condition to select the best track, and instead of using isNonnull() here, I would use the bestTrack.recHitsOk(). Also, and maybe here was my mistake, I removed this condition, which you didn't. That is why I asked @Dr15Jones if the check for the availability for TrackExtra is done every time a reco::Track is being used

borzari commented 5 months ago

@mmusich I started from your branch and tested what I mentioned above:

May I start a PR to include your changes and the recHitsOk() method to CMSSW?

mmusich commented 5 months ago

May I start a PR to include your changes and the recHitsOk() method to CMSSW?

here it is: https://github.com/cms-sw/cmssw/pull/45213. I used the CMSTrackingPOG VO so you should be able to push more commits if necessary.

borzari commented 5 months ago

here it is: #45213. I used the CMSTrackingPOG VO so you should be able to push more commits if necessary.

Great! I don't think there are any other modifications that are needed. Just FYI, I also checked the output DQM histograms of that branch using RelValZMM events and they are the same as before the changes, as expected

slava77 commented 5 months ago

@Dr15Jones most of the recent discussion was about the try/catch: this is now fixed in #45213 . I don't expect though that this would address the underlying cause that lead to this github issue (the promptReco failure). If I understand correctly the fix in tracking is mostly a convenience for debugging using catch throw to not be distracted. Then the actual problem is likely more related to root handling the data.

Is my understanding correct?

Dr15Jones commented 5 months ago

@slava77 I'm on vacation until Thursday. The try/catch fixes were only there to make it easier to get to the underlying problem in the debugger. It does look like the underlying problem is in ROOT.

makortel commented 2 months ago

Coming back to problem itself, in https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381379-parkingsinglemuon4/42082/7 the likely cause was mentioned to be a corrupted file. I suppose there were no further similar failures? Under the file corruption hypothesis, maybe we could just close the issue?

jfernan2 commented 1 month ago

+1 Issue seems solved

cmsbuild commented 1 month ago

This issue is fully signed and ready to be closed.

makortel commented 1 month ago

@cmsbuild, please close