cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Segmentation violation in CSCMonitorModule during Express job at T0 #45797

Open germanfgv opened 2 weeks ago

germanfgv commented 2 weeks ago

There is a segmentation violation affecting single job for Workflow Express_Run384963_StreamExpress. The crash occurs in module CSCMonitorModule:dqmCSCClient. Here the stack trace and error message:

Thread 8 (Thread 0x1484addfd700 (LWP 1097) "cmsRun"):
#0  0x00001485053baac1 in poll () from /lib64/libc.so.6
#1  0x0000148501b3443f in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x0000148501ae94bc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x0000148501ae9640 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00001484947afc62 in cscdqm::EventProcessor::processCSC(CSCEventData const&, int, CSCDCCExaminer const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#6  0x00001484947c1f14 in cscdqm::EventProcessor::processDDU(CSCDDUEventData const&, CSCDCCExaminer const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#7  0x00001484947c690a in cscdqm::EventProcessor::processEvent(edm::Event const&, edm::InputTag const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#8  0x00001484947ad629 in cscdqm::Dispatcher::processEvent(edm::Event const&, edm::InputTag const&, cscdqm::HWStandbyType&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#9  0x00001484947defbb in CSCMonitorModule::analyze(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#10 0x00001484947de3f0 in non-virtual thunk to DQMOneEDAnalyzer<>::accumulate(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#11 0x000014850803daae in edm::one::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#12 0x00001485080281fe in edm::WorkerT<edm::one::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x0000148507fba639 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x0000148507fbb70f in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#15 0x00001485082c5650 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#16 0x000014850738b95b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x1484205f9b00, waiter=..., this=0x148503dc9180) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#17 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x148503dc9180) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#18 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:137
#19 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/market.cpp:599
#20 0x000014850738db0e in tbb::detail::r1::rml::private_worker::run (this=0x148501e9df00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#21 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x148501e9df00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#22 0x00001485056661ca in start_thread () from /lib64/libpthread.so.0
#23 0x00001485052c18d3 in clone () from /lib64/libc.so.6

...
Current Modules:

Module: CSCMonitorModule:dqmCSCClient (crashed)
Module: MkFitProducer:pixelLessStepTrackCandidatesMkFit
Module: TrackingRecoMaterialAnalyser:materialDumperAnalyzer
Module: TrackProducer:pixelLessStepTracks
Module: TrackProducer:pixelPairStepTracks
Module: TrackTfClassifier:lowPtQuadStep
Module: MultiHitFromChi2EDProducer:pixelLessStepHitTriplets
Module: MkFitProducer:initialStepTrackCandidatesMkFitPreSplitting

A fatal system signal has occurred: segmentation violation

The job ran 4 times, both in AMD and Intel machines, failing always the same way. You can find logs and tarball to reproduce the error here:

/eos/user/c/cmst0/public/PausedJobs/Run2024G/dqmCSCClient/job_2218360

Can experts take a look?

Best regards

cmsbuild commented 2 weeks ago

cms-bot internal usage

cmsbuild commented 2 weeks ago

A new Issue was created by @germanfgv.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mandrenguyen commented 2 weeks ago

assign dqm

cmsbuild commented 2 weeks ago

New categories assigned: dqm

@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 2 weeks ago

FYI @cms-sw/csc-dpg-l2

germanfgv commented 2 weeks ago

We have 4 more instances of this issue, now in PromptReco jobs. All of them affecting the same run 384963. You can find the tarball of one of these PromptReco jobs here:

/eos/user/c/cmst0/public/PausedJobs/Run2024G/dqmCSCClient/job_2652227/
mandrenguyen commented 2 weeks ago

@cms-sw/csc-dpg-l2 @cms-sw/dqm-l2 Can someone please have a look

ptcox commented 2 weeks ago

There’s very likely a corrupt event. I’ve forwarded the mail to Victor Barashko, who’s both the CSC unpacker and DQM expert. I’m on vacation so won’t be looking at it. Tim

On Aug 28, 2024, at 09:54, Matthew Nguyen @.***> wrote:



@cms-sw/csc-dpg-l2https://github.com/orgs/cms-sw/teams/csc-dpg-l2 @cms-sw/dqm-l2https://github.com/orgs/cms-sw/teams/dqm-l2 Can someone please have a look

— Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/45797#issuecomment-2314730851, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABGYLHTYHMRZIHVGHARPTHDZTWF45AVCNFSM6AAAAABNCVWTHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJUG4ZTAOBVGE. You are receiving this because you are on a team that was mentioned.Message ID: @.***>

mmusich commented 5 days ago

@ptcox

There’s very likely a corrupt event. I’ve forwarded the mail to Victor Barashko, who’s both the CSC unpacker and DQM expert. I’m on vacation so won’t be looking at it.

do you happen to have any news about this?

ptcox commented 5 days ago

Hi Marco, I'm back from vacation today and somewhat surprised to see no apparent progress. Maybe Victor was on vacation too. I'll remind him. Tim

Marco Musich @.***> September 9, 2024 at 12:09

@ptcox https://github.com/ptcox

do you happen to have any news about this?

— Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/45797#issuecomment-2337694997, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGYLHQCHH5U4ZLRTUTIP4TZVVXWDAVCNFSM6AAAAABNCVWTHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZXGY4TIOJZG4. You are receiving this because you were mentioned.Message ID: @.***>

--------------FB764379428B7E468C3B5E22 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit

Hi Marco,
I'm back from vacation today and somewhat surprised to see no apparent progress. Maybe Victor was on vacation too. I'll remind him.
Tim

September 9, 2024 at 12:09

do you happen to have any news about this?


Reply to this email directly,
view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: <cms-sw/cmssw/issues/45797/2337694997@github.com>


--------------FB764379428B7E468C3B5E22--

mmusich commented 5 days ago

I'll remind him.

thank you, Tim.