Open LinaresToine opened 6 hours ago
cms-bot internal usage
A new Issue was created by @LinaresToine.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction
New categories assigned: reconstruction
@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
@cms-sw/tracking-pog-l2
The report features a paused job with logs and tarballs of a couple of retries in
/eos/user/c/cmst0/public/PausedJobs/HIRun2024A/LongJob/job_2777101
In this case it looked like the event 136410066:93:388171
took very long time (more than 2 days) to process (see https://cms-talk.web.cern.ch/t/paused-promptreco-job-for-run-388171-stuck-for-40h-after-disconnection/66372/2 for more details). In both cases the job was killed while only CkfTrackCandidateMaker:muonSeededTrackCandidatesInOut
was running, with the following stack trace
Thread 9 (Thread 0x147c4d1ff700 (LWP 1098) "cmsRun"):
#0 0x0000147ca6790098 in nanosleep () from /lib64/libc.so.6
#1 0x0000147ca678ff9e in sleep () from /lib64/libc.so.6
#2 0x0000147ca2e201d0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x0000147c28cbec84 in TrajectorySegmentBuilder::addGroup(TempTrajectory const&, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#5 0x0000147c28cbea65 in TrajectorySegmentBuilder::addGroup(TempTrajectory const&, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#6 0x0000147c28cc01ca in TrajectorySegmentBuilder::segments(TrajectoryStateOnSurface) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#7 0x0000147c28cc0a8d in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#8 0x0000147c28cc181a in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#9 0x0000147c28cb5e44 in GroupedCkfTrajectoryBuilder::rebuildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#10 0x0000147c2ce499a8 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so
#11 0x0000147c2ce4a89b in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so
#12 0x0000147ca9458513 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x0000147ca943fa6c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x0000147ca93c42c9 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#15 0x0000147ca93c47c1 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#16 0x0000147ca9635238 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#17 0x0000147ca878bb3b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x147bf1411200, waiter=..., this=0x147ca51ce780) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#18 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x147ca51ce780) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#19 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/arena.cpp:137
#20 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/market.cpp:599
#21 0x0000147ca878dcee in tbb::detail::r1::rml::private_worker::run (this=0x147ca31d5f00) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#22 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x147ca31d5f00) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#23 0x0000147ca6a661ca in start_thread () from /lib64/libpthread.so.0
#24 0x0000147ca66c18d3 in clone () from /lib64/libc.so.6
A long job from
HIRun2024B
, with a failure in the same module, has also been retried twice. The tarball for such job is in/eos/home-c/cmst0/public/PausedJobs/HIRun2024B/LongJob/36388e0c-9895-41d8-b59f-a0d7393c5508-139-1-logArchive.tar.gz
Also here only CkfTrackCandidateMaker:muonSeededTrackCandidatesInOut
was still running at the time of the kill, with a stack trace of
hread 5 (Thread 0x149ff6fff700 (LWP 1167) "cmsRun"):
#0 0x000014a04fd3d098 in nanosleep () from /lib64/libc.so.6
#1 0x000014a04fd3cf9e in sleep () from /lib64/libc.so.6
#2 0x000014a04c0351d0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x0000149fcf2bac8e in TrajectorySegmentBuilder::addGroup(TempTrajectory const&, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#5 0x0000149fcf2baa65 in TrajectorySegmentBuilder::addGroup(TempTrajectory const&, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#6 0x0000149fcf2bc1ca in TrajectorySegmentBuilder::segments(TrajectoryStateOnSurface) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#7 0x0000149fcf2bca8d in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#8 0x0000149fcf2bd81a in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#9 0x0000149fcf2b1e44 in GroupedCkfTrajectoryBuilder::rebuildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#10 0x0000149fd33fb9a8 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so
#11 0x0000149fd33fc89b in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so
#12 0x000014a0527f7513 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x000014a0527dea6c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x000014a0527632c9 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libFWCoreFramework.so
#15 0x000014a0527637c1 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libFWCoreFramework.so
#16 0x000014a0524e2238 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#17 0x000014a050ed3b3b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x14a04e432a00, waiter=..., this=0x14a04e5ca700) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#18 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x14a04e5ca700) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#19 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/arena.cpp:137
#20 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/market.cpp:599
#21 0x000014a050ed5cee in tbb::detail::r1::rml::private_worker::run (this=0x14a04c3f2f80) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#22 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x14a04c3f2f80) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#23 0x000014a0500131ca in start_thread () from /lib64/libpthread.so.0
#24 0x000014a04fc6e8d3 in clone () from /lib64/libc.so.6
In this case the long-running event seems to be 171352672:140:388306
.
Hello all,
I am opening this issue for further investigation of the ongoing Long Jobs that have been seen at Tier 0. The original report is in cms-talk, where @makortel suggests an issue with module
CkfTrackCandidateMaker:muonSeededTrackCandidatesInOut
taking too long with some events.The report features a paused job with logs and tarballs of a couple of retries in
/eos/user/c/cmst0/public/PausedJobs/HIRun2024A/LongJob/job_2777101
A long job from
HIRun2024B
, with a failure in the same module, has also been retried twice. The tarball for such job is in/eos/home-c/cmst0/public/PausedJobs/HIRun2024B/LongJob/36388e0c-9895-41d8-b59f-a0d7393c5508-139-1-logArchive.tar.gz
Could an expert please take a look?