cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

Error in THashTable::Rehash in HLTFiltersDQMonitor #40419

Open makortel opened 1 year ago

makortel commented 1 year ago

Workflow 139.003 step 3 failed in CMSSW_13_0_X_2023-01-03-1100 on el8_amd64_gcc11 with

----- Begin Fatal Exception 03-Jan-2023 14:09:46 CET-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Processing  stream begin Run run: 346512 stream: 0
   [1] Calling method for module HLTFiltersDQMonitor/'hltFiltersDQM'
   Additional Info:
      [a] Fatal Root Error: @SUB=THashTable::Rehash
During the rehash of 0x2b908da53120 one or more element was added or removed. The initalize size was 748 and now it is 749

----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_0_X_2023-01-03-1100/pyRelValMatrixLogs/run/139.003_RunHLTPhy2021/step3_RunHLTPhy2021.log#/

(I think I saw this error also earlier, but couldn't find it now)

cmsbuild commented 1 year ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

assign dqm

FYI @cms-sw/hlt-l2

cmsbuild commented 1 year ago

New categories assigned: dqm

@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

As far as I can tell, the only ROOT operations HLTFiltersDQMonitor::bookHistograms() does are creation of TProfile and calling SetBinLabel() repeatedly for the x TAxis of the profile in three places https://github.com/cms-sw/cmssw/blob/9eb6ddb56d8ead1a0d6a07eac1deafe4d03cb3e6/DQMOffline/Trigger/plugins/HLTFiltersDQMonitor.cc#L148-L160 https://github.com/cms-sw/cmssw/blob/9eb6ddb56d8ead1a0d6a07eac1deafe4d03cb3e6/DQMOffline/Trigger/plugins/HLTFiltersDQMonitor.cc#L174-L188 https://github.com/cms-sw/cmssw/blob/9eb6ddb56d8ead1a0d6a07eac1deafe4d03cb3e6/DQMOffline/Trigger/plugins/HLTFiltersDQMonitor.cc#L223-L240

Poking around in ROOT code I see TAxis::SetBinLabel() can lead to call to THashTable::Rehash() if the label of the given bin was already set https://github.com/cms-sw/root/blob/cms/v6-24-00-patches/b5aa8fd/hist/hist/src/TAxis.cxx#L832-L843

If I recall/understood correctly the DQMStore's behavior on "local MonitorElements" that the module instance sees (described here), all "local MonitorElements" of the module's stream instances use the data of the connected "global MonitorElement" underneath, which means that in a multi-stream job the axis labels are re-set for all but first stream to call HLTFiltersDQMonitor::bookHistograms() (in the log quoted in the issue description the streams 1-3 had started processing events before the exception is printed for stream 0).

The locking in DQMStore::bookTransaction() guarantees that only one call to any bookHistograms() is active at any time, so there should be no concurrent calls to TAxis::SetBinLabel(). But given the log, something that could be happening are concurrent calls to HLTFiltersDQMonitor::bookHistograms() and HLTFiltersDQMonitor::analyze().

The relevant member functions in THashList (and THashTable() that is used underneath) that TAxis uses seem to have proper locking, but I'm not sure if e.g. TAxis::SetBinLabel() and TAxis::FindBin() are safe to be called concurrently.

(or I'm completely wrong of the call chain)

Adding @pcanal

missirol commented 1 year ago

Fwiw, I couldn't reproduce the crash locally yet, and I haven't spotted it in other recent IBs.

Trying to follow the analysis in https://github.com/cms-sw/cmssw/issues/40419#issuecomment-1369993100 ..

makortel commented 1 year ago

While testing locally (trying to insert an artificial delay to stream 0, although I didn't succeed in that yet), I was able to get a segfault

Thread 5 (Thread 0x7f6f847ff700 (LWP 4308) "cmsRun"):
#2  0x00007f6fc9414630 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign (this=0x7f6f1d400738, __str=...) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/slc7_amd64_gcc11/external/gcc/11.2.1-f9b9dfdd886f71cd63f5538223d8f161/gcc-11.2.1/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:254
#5  0x00007f6f886ff1b8 in l1t::L1TGlobalUtil::retrieveL1Event(edm::Event const&, edm::EventSetup const&, edm::EDGetToken) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/libL1TriggerL1TGlobal.so
#6  0x00007f6f79760f57 in HLTPrescaleProvider::prescaleSet(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/libHLTriggerHLTcore.so
#7  0x00007f6f578fd9ea in pat::PATTriggerProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginPhysicsToolsPatAlgos_plugins.so
#8  0x00007f6fd335d13d in edm::stream::EDProducerAdaptorBase::doEvent (this=0x7f6f3ff041e0, info=..., act=0x7f6fc4057810, mcc=mcc@entry=0x7f6f33c9e168) at /build/mkortela/debug/CMSSW_13_0_X_2023-01-03-1100/src/FWCore/Framework/src/stream/EDProducerAdaptorBase.cc:82

Thread 4 (Thread 0x7f6f85928700 (LWP 4307) "cmsRun"):
#2  0x00007f6fc9414630 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f6fd22c9bcd in TListIter::Next() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#5  0x00007f6fc86eb297 in TAxis::SetBinLabel(int, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libHist.so
#6  0x00007f6f70c83138 in HLTFiltersDQMonitor::bookHistograms (this=0x7f6f751e5900, iBooker=..., iRun=..., iSetup=...) at /build/mkortela/debug/CMSSW_13_0_X_2023-01-03-1100/src/DQMOffline/Trigger/plugins/HLTFiltersDQMonitor.cc:158
#7  0x00007f6f70bca008 in DQMEDAnalyzer::beginRun(edm::Run const&, edm::EventSetup const&)::{lambda(dqm::implementation::IBooker&)#1}::operator()(dqm::implementation::IBooker&) const (booker=..., __closure=<synthetic pointer>) at /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/src/DQMServices/Core/interface/DQMEDAnalyzer.h:87
#8  dqm::implementation::DQMStore::bookTransaction<DQMEDAnalyzer::beginRun(edm::Run const&, edm::EventSetup const&)::{lambda(dqm::implementation::IBooker&)#1}>(DQMEDAnalyzer::beginRun(edm::Run const&, edm::EventSetup const&)::{lambda(dqm::implementation::IBooker&)#1}, unsigned long, bool) (canSaveByLumi=<optimized out>, moduleId=144, f=..., this=0x7f6fc252c480) at /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/src/DQMServices/Core/interface/DQMStore.h:705
#9  DQMEDAnalyzer::beginRun (this=0x7f6f751e5900, run=..., setup=...) at /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/src/DQMServices/Core/interface/DQMEDAnalyzer.h:84
#10 0x00007f6fd335e7ad in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::doStreamBeginRun (this=0x7f6f85921b40, id=..., info=..., mcc=0x7f6f712528a8) at /build/mkortela/debug/CMSSW_13_0_X_2023-01-03-1100/src/FWCore/Framework/src/stream/ProducingModuleAdaptorBase.cc:224

Thread 3 (Thread 0x7f6f86329700 (LWP 4306) "cmsRun"):
#2  0x00007f6fc9414630 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f6fd0b3e54d in __lll_lock_wait () from /lib64/libpthread.so.0
#5  0x00007f6fd0b39eb6 in _L_lock_941 () from /lib64/libpthread.so.0
#6  0x00007f6fd0b39daf in pthread_mutex_lock () from /lib64/libpthread.so.0
#7  0x00007f6fca06ad8c in dqm::implementation::DQMStore::enterLumi(unsigned int, unsigned int, unsigned long) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/libDQMServicesCore.so
#8  0x00007f6f710490bd in non-virtual thunk to DQMOneEDAnalyzer<>::accumulate(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginDQMServicesComponentsPlugins.so
#9  0x00007f6fd3355a50 in edm::one::EDProducerBase::doEvent (this=0x7f6f71f32fa0, info=..., act=0x7f6fc4057810, mcc=mcc@entry=0x7f6f33668568) at /build/mkortela/debug/CMSSW_13_0_X_2023-01-03-1100/src/FWCore/Framework/src/one/EDProducerBase.cc:61

Thread 1 (Thread 0x7f6fce99f740 (LWP 4221) "cmsRun"):
#3  0x00007f6fc9417deb in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f6fd22ccafc in TList::Clear(char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#6  0x00007f6fd22c695e in THashTable::Clear(char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#7  0x00007f6fd22c7ddd in THashTable::Rehash(int, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#8  0x00007f6fd22c84bd in THashTable::Add(TObject*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#9  0x00007f6fd22c50bd in THashList::AddLast(TObject*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#10 0x00007f6fc86eb0ed in TAxis::FindBin(char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libHist.so
#11 0x00007f6f70c8165e in HLTFiltersDQMonitor::analyze (this=0x7f6f751e6800, iEvent=..., iSetup=...) at /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/external/gcc/11.2.1-f9b9dfdd886f71cd63f5538223d8f161/include/c++/11.2.1/bits/basic_string.h:194
#12 0x00007f6fd335d13d in edm::stream::EDProducerAdaptorBase::doEvent (this=0x7f6f71251680, info=..., act=0x7f6fc4057810, mcc=mcc@entry=0x7f6f32c460a8) at /build/mkortela/debug/CMSSW_13_0_X_2023-01-03-1100/src/FWCore/Framework/src/stream/EDProducerAdaptorBase.cc:82
#13 0x00007f6fd334167f in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo (this=0x7f6f32c46080, info=..., mcc=0x7f6f32c460a8) at /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/external/gcc/11.2.1-f9b9dfdd886f71cd63f5538223d8f161/include/c++/11.2.1/bits/shared_ptr_base.h:1295

Current Modules:
Module: HLTFiltersDQMonitor:hltFiltersDQM (crashed)
Module: HLTFiltersDQMonitor:hltFiltersDQM
Module: DQMEventInfo:dqmEnvHLT
Module: PATTriggerProducer:patTrigger

which demonstrates that somehow in this case TAxis::FindBin() (via analyze()) ends up in adding a new element, which leads to call to THashTable::Rehash() (via bookHistograms()), which then has a race condition with the iteration over labels in TAxis::SetBinLabel().

pcanal commented 1 year ago

but I'm not sure if e.g. TAxis::SetBinLabel() and TAxis::FindBin() are safe to be called concurrently.

I don't recall that we made it so. @lmoneta can you confirm whether this is the case or not?

makortel commented 1 year ago
  • one way to improve the behaviour of the client might be to (1) do something like
    if(meMenu_->getTProfile()->GetXaxis()->GetBinLabel(idx + 1) != triggerNames.at(idx).c_str());
     meMenu_->getTProfile()->GetXaxis()->SetBinLabel(idx + 1, triggerNames.at(idx).c_str());

    in bookHistograms

A better way would be to move the SetBinLabel() to be called within the booking, i.e. something along

meMenu_ = iBooker.bookProfile(efficPlotNamePrefix_ + hltMenuName,
                                "Path Efficiency",
                                triggerNames.size(),
                                0.,
                                triggerNames.size(),
                                -0.1,
                                1.1,
                                "",
                                [&triggerNames](TProfile *profile) {
                                  for (size_t idx = 0; idx < triggerNames.size(); ++idx) {
                                    profile->GetXaxis()->SetBinLabel(idx + 1, triggerNames.[idx].c_str());
                                  }
                                });

Then the DQMStore would call it only when a new TProfile object is created.

makortel commented 1 year ago

It seems to me that one general issue that has not been fully addressed yet is the DQM infrastructure allowing access to bare ROOT objects (which is the case here). I see some earlier discussion in https://github.com/cms-sw/cmssw/issues/25106 and in https://github.com/cms-sw/cmssw/issues/29096.

Maybe now would be a good time to finally move away from that?

dan131riley commented 1 year ago

(I think I saw this error also earlier, but couldn't find it now)

Seen around a month ago,

Begin processing the 2nd record. Run 346512, Event 272599352, LumiSection 280 on stream 0 at 28-Nov-2022 05:55:48.212 CET
----- Begin Fatal Exception 28-Nov-2022 05:55:48 CET-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Processing  stream begin Run run: 346512 stream: 2
   [1] Calling method for module HLTFiltersDQMonitor/'hltFiltersDQM'
   Additional Info:
      [a] Fatal Root Error: @SUB=THashTable::Rehash
During the rehash of 0x4001a2ff9e40 one or more element was added or removed. The initalize size was 756 and now it is 757

----- End Fatal Exception -------------------------------------------------

Same run #, not sure if it is the same WF.

missirol commented 1 year ago

@makortel, thanks for the suggestions in https://github.com/cms-sw/cmssw/issues/40419#issuecomment-1370151052 and https://github.com/cms-sw/cmssw/issues/40419#issuecomment-1370156063. I tried to implement a fix in #40426.

missirol commented 1 year ago

As far as I know, #40426 fixed this problem.

I wonder if it should be backported, though (for example, to 12_6_X, which will be used online in the near future).

perrotta commented 1 year ago

As far as I know, #40426 fixed this problem.

I wonder if it should be backported, though (for example, to 12_6_X, which will be used online in the near future).

If it can help the few MWGR's in which 12_6_X will be used, why not?

missirol commented 1 year ago

Okay. Any reason to consider even lower cycles (12_4, 12_5) ?

perrotta commented 1 year ago

Okay. Any reason to consider even lower cycles (12_4, 12_5) ?

I don't think so, but I'd let it up to you

missirol commented 1 year ago

Okay. #40536 is the 12_6_X backport. If others think other backports are useful, it can be discussed here.

missirol commented 1 year ago

The proposed fix was integrated in 13_0_X (#40426) and 12_6_X (#40536). Unless there are doubts about the fix, or requests for other backports, this issue can be closed.

makortel commented 1 year ago

It seems to me that one general issue that has not been fully addressed yet is the DQM infrastructure allowing access to bare ROOT objects (which is the case here). I see some earlier discussion in #25106 and in #29096.

Maybe now would be a good time to finally move away from that?

I think we should follow up with the bare ROOT object access from MonitorElement, but probably in a separate issue. @cms-sw/dqm-l2 @cms-sw/orp-l2 what do you think?

missirol commented 1 year ago

It seems to me that one general issue that has not been fully addressed yet is the DQM infrastructure allowing access to bare ROOT objects (which is the case here). I see some earlier discussion in #25106 and in #29096. Maybe now would be a good time to finally move away from that?

I think we should follow up with the bare ROOT object access from MonitorElement, but probably in a separate issue. @cms-sw/dqm-l2 @cms-sw/orp-l2 what do you think?

@cms-sw/dqm-l2

Based on https://github.com/cms-sw/cmssw/issues/40676#issuecomment-1428768169 , it's probably time to open a dedicated issue about deprecating/forbidding access to bare ROOT objects via MonitorElement. And then this issue could also be closed (?).