Possible bug in DQMStore::leaveLumi()

tvami commented 2 years ago

There seems to be a bug in DQMStore::leaveLumi(), please see https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2289.html for the PromptReco_Run345687_HLTPhysics wf and then https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2289/2.html PromptReco_Run345595_HcalNZS

While the first one (345687) had this special DQM key in for a test, the second run seems very much like a vanilla run, which would mean that this is just a bug that comes up time to time.

The way to reproduce the first one

cmsrel CMSSW_12_0_2_patch1
cd CMSSW_12_0_2_patch1/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/NotCRAFT/job_47574/tarball/vocms0314.cern.ch-47574-3-log.tar.gz .
tar -xzvf vocms0314.cern.ch-47574-3-log.tar.gz
cd job/WMTaskSpace/cmsRun1/
cmsRun PSet.py

and then the second one

cmsrel CMSSW_12_0_2_patch1
cd CMSSW_12_0_2_patch1/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/NotCRAFT/job_24031/tarball/e711688c-1d88-43a6-9f84-bae2f2cd2036-0-6-logArchive.tar.gz .
tar -xzvf e711688c-1d88-43a6-9f84-bae2f2cd2036-0-6-logArchive.tar.gz
cd job/WMTaskSpace/cmsRun1/
cmsRun PSet.py

tvami commented 2 years ago

assign dqm

cmsbuild commented 2 years ago

New categories assigned: dqm

@jfernan2,@ahmad3213,@rvenditti,@emanueleusai,@pbo0,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 years ago

A new Issue was created by @tvami Tamas Vami.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

tvami commented 2 years ago

I guess I could also add that I see a lot of msg like

%MSG-w BeamFitter:  AlcaBeamMonitor:AlcaBeamMonitor@endLumi  19-Oct-2021 10:36:02 CEST Run: 345687 Lumi: 28
No event read! No Fitting!
%MSG

jfernan2 commented 2 years ago

According to the log on: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2289.html the module which is crashing (or am I confused by the log?) is: Module: RawTask:rawTask (crashed) which is defined in: DQM/HcalTasks/interface/RawTask.h

So, my first suspect is this commit: https://github.com/cms-sw/cmssw/commit/6e69bcf2e0450af08d75aacdd287c3c17ffc6e5c#diff-016dac97270a9f821146f1454911ea7cc5841dea61c3b3d582ce6df26aa47ddc I do not know why it may show up now... @lwang046 if you can have a look too, we may find the reason faster :-)

However, for the moment I did not reach the point of the crash with the instructions for the second crash above.

tvami commented 2 years ago

However, for the moment I did not reach the point of the crash with the instructions for the second crash above.

It supposed to be after LS 6, for the record I also had reproducibility issues, but doing it the second or third time I managed to reproduce the crash

jfernan2 commented 2 years ago

The first crash on the other side, is stopping on: Module: EcalDQMonitorTask:ecalMonitorTask (crashed) defined in: DQM/EcalMonitorTasks/interface/EcalDQMonitorTask.h

So, we may be facing two different modules crashing when processing concurrent LS, I would point to this commit: https://github.com/cms-sw/cmssw/commit/9c1687e5e266d3887a118ef2596b365da80b4dca#diff-d0d3ee9239e421e02c7b49576a0ffd63979d2709ad81372f3652f2d915db1227 @tanmaymudholkar any idea?

jfernan2 commented 2 years ago

A question: is the first time in a long time that we are processing concurrent lumisections centrally? Because the error is produced in this exact point related to concurrent LS: https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/src/DQMStore.cc#L504-L514

francescobrivio commented 2 years ago

A question: is the first time in a long time that we are processing concurrent lumisections centrally? Because the error is produced in this exact point related to concurrent LS: https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/src/DQMStore.cc#L504-L514

maybe @makortel knows?

makortel commented 2 years ago

Concurrent lumis have been used in RelVals since 11_3_0_pre1 when they were enabled by default in multithreaded jobs (in principle at least, then if all RelVal jobs would process only one lumi per job that would lead to the feature effectively not being tested).

Dr15Jones commented 2 years ago

From the log, the job failed because of an assert failure

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_2-slc7_amd64_gcc900/build/CMSSW_12_0_2-build/tmp/BUILDROOT/2faca74e5639e00b55cc90eec0352f77/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_2/src/DQMServices/Core/src/DQMStore.cc:512: dqm::implementation::DQMStor
e::leaveLumi(edm::RunNumber_t, edm::LuminosityBlockNumber_t, uint64_t)::<lambda(MonitorElementData::Scope)>: Assertion `!"Impossible Scope."' failed.

jfernan2 commented 2 years ago

Concurrent lumis have been used in RelVals since 11_3_0_pre1 when they were enabled by default in multithreaded jobs (in principle at least, then if all RelVal jobs would process only one lumi per job that would lead to the feature effectively not being tested).

Yes, it may be the case, first run processed centrally in an effective concurrent LS way. I am trying to debug in those exact lines from DQMStore printing the right point where it crashes, it looks like several modules (in a random way) produce teh crash but always when:

run 345687 lumi 0 scope 2

makortel commented 2 years ago

In gdb I see an offending MonitorElement has

run_ = 0, luminosityBlock_ = 0}, scope_ = 0, kind_ = MonitorElementData::Kind::INVALID

with a path JetMET/MET/pfMETT1/Cleaned/METSig with crash coming from METAnalyzer:pfMetT1DQMAnalyzer.

Dr15Jones commented 2 years ago

The file being used has no events

080fd1ac-818f-4a2d-a10c-a55415610c20.root (1 runs, 28 lumis, 0 events, 110098 bytes)

Printing IndexIntoFile contents.  This includes a list of all Runs, LuminosityBlocks
and Events stored in the root file.

            Run           Lumi          Event    TTree Entry
         345687              0              0              0 (Run)
         345687              1              0              0 (Lumi)
         345687              2              0              1 (Lumi)
         345687              3              0              2 (Lumi)
         345687              4              0              3 (Lumi)
         345687              5              0              4 (Lumi)
         345687              6              0              5 (Lumi)
         345687              7              0              6 (Lumi)
         345687              8              0              7 (Lumi)
         345687              9              0              8 (Lumi)
         345687             10              0              9 (Lumi)
         345687             11              0             10 (Lumi)
         345687             12              0             11 (Lumi)
         345687             13              0             12 (Lumi)
         345687             14              0             13 (Lumi)
         345687             15              0             14 (Lumi)
         345687             16              0             15 (Lumi)
         345687             17              0             16 (Lumi)
         345687             18              0             17 (Lumi)
         345687             19              0             18 (Lumi)
         345687             20              0             19 (Lumi)
         345687             21              0             20 (Lumi)
         345687             22              0             21 (Lumi)
         345687             23              0             22 (Lumi)
         345687             24              0             23 (Lumi)
         345687             25              0             24 (Lumi)
         345687             26              0             25 (Lumi)
         345687             27              0             26 (Lumi)
         345687             28              0             27 (Lumi)

and therefore concurrent lumis are going to be happenning. When there are events in the input file the overlaps of lumis would probably be much less frequent.

makortel commented 2 years ago

In gdb I see an offending MonitorElement has
run_ = 0, luminosityBlock_ = 0}, scope_ = 0, kind_ = MonitorElementData::Kind::INVALID
with a path JetMET/MET/pfMETT1/Cleaned/METSig with crash coming from METAnalyzer:pfMetT1DQMAnalyzer.

The METSig is a lumi-scoped histogram https://github.com/cms-sw/cmssw/blob/596662e79c88655c56891219ecaf687fc136953d/DQMOffline/JetMET/src/METAnalyzer.cc#L343-L346

The assert is firing from DQMOneEDAnalyzer<>::endRunProduce().

jfernan2 commented 2 years ago

Thanks a lot @makortel and @Dr15Jones ! As expected, you were much faster than me finding the issue. However, do you get always the same crash from METAnalyzer? Because I have also seen crashes from DQM/EcalMonitorTasks/interface/EcalDQMonitorTask.h and DQM/HcalTasks/interface/RawTask.h as I pointed out

pmandrik commented 2 years ago

In gdb I see an offending MonitorElement has
run_ = 0, luminosityBlock_ = 0}, scope_ = 0, kind_ = MonitorElementData::Kind::INVALID
with a path JetMET/MET/pfMETT1/Cleaned/METSig with crash coming from METAnalyzer:pfMetT1DQMAnalyzer.

Over the several runs I see that crash for 28 Lumi happens also for MonitorElements with name & path: UnknownIds & Hcal/RecHitTask/ UnknownIds & Hcal/TPTask/ in addition to METSig & JetMET/MET/caloMet/Cleaned/ due to uncorrected value of MonitorElement->getScope() value like -1867361216

rappoccio commented 2 years ago

@makortel @Dr15Jones do you think we should implement a check to require >0 events at the merging level? Would that help?

jfernan2 commented 2 years ago

@makortel @Dr15Jones unless you have a smart idea, I don't see an easy way to debug all the offedning modules in a fast way. I wonder if there is any option in the FW to skip files with no events at processing level....which may give us some time

makortel commented 2 years ago

If a quick workaround is needed, that would likely be process.numberOfConcurrentLuminosityBlocks = 1.

jfernan2 commented 2 years ago

Yes, but wouldn't that be a "waste" of resources?

davidlange6 commented 2 years ago

That would give analysts Incorrect answers for any physics data...

On Oct 19, 2021, at 5:24 PM, rappoccio @.***> wrote:

@makortel @Dr15Jones do you think we should implement a check to require >0 events at the merging level? Would that help?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

jfernan2 commented 2 years ago

If a quick workaround is needed, that would likely be process.numberOfConcurrentLuminosityBlocks = 1.

@makortel should this be done as default option in: https://github.com/cms-sw/cmssw/blob/d68bfe91bc4faa34756931b98318d1f4a72d60e1/FWCore/ParameterSet/python/Config.py#L2053

or

https://github.com/cms-sw/cmssw/blob/d68bfe91bc4faa34756931b98318d1f4a72d60e1/FWCore/ParameterSet/python/Config.py#L223

Or perhaps just pàssing the parameter to the current Jobs being done by Operation managers? Thanks

makortel commented 2 years ago

Since we're talking about a "hot workaround", I think the highest level that is convenient would be best. E.g. along

if cmsDriver is used, use --nConcurrentLumis=1 argument
if it is easy to tweak the job configurations, add process.numberOfConcurrentLuminosityBlocks = 1 there
if cmsDriver is not used, but ConfigBuilder is, set the default there https://github.com/cms-sw/cmssw/blob/0f0430f4415229a0695a84a5c9331636d3525772/Configuration/Applications/python/ConfigBuilder.py#L90
only if nothing else works, we could consider changing the framework default

davidlange6 commented 2 years ago

You need to change things in the release. - one of the last two choices will work.

On Oct 20, 2021, at 2:43 PM, Matti Kortelainen @.***> wrote:

Since we're talking about a "hot workaround", I think the highest level that is convenient would be best. E.g. along

• if cmsDriver is used, use --nConcurrentLumis=1 argument • if it is easy to tweak the job configurations, add process.numberOfConcurrentLuminosityBlocks = 1 there • if cmsDriver is not used, but ConfigBuilder is, set the default there https://github.com/cms-sw/cmssw/blob/0f0430f4415229a0695a84a5c9331636d3525772/Configuration/Applications/python/ConfigBuilder.py#L90 • only if nothing else works, we could consider changing the framework default — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

Dr15Jones commented 2 years ago

So I have found the order of operations that cause the problem. (Note ME is shorthand for MonitorElement)

book histogram injects ME into DQMStore::globalMEs with run=lumi=0 and the same one into DQMStore::localMEs with run=lumi=0
PreGlobalBeginLumi Service Call for lumi 1
- DQMStore::initLumi called and run=lumi=0 is removed from DQMStore::globalMEs and a new ME using the old metadata as lumi=1 added to DQMStore::globalMEs
PreGlobalBeginLumi Service Call for lumi 2
- DQMStore::initLumi called and new ME created with new metadata and lumi=2 added to DQMStore::globalMEs_
PostGlobalWriteLumi Service Call
- DQMStore::cleanupLumi for lumi 1
  - removes lumi 1 histograms from DQMStore::globalMEs_
  - finds lumi 2 histograms and decided to delete the lumi 1 ME (which is also being used by DQMStore::localMEs_)
PostGlobalWriteLumi Service Call
- DQMStore::cleanupLumi for lumi 2
  - removes lumi 1 histograms from DQMStore::globalMEs_
  - finds no other histograms so moves MEs into DQMStore::globalMEs_ with run=lumi=0
called from DQMOneEDAnalyzer::endRunProduce
- DQMStore::leaveLumi for lumi = 0
  - looks at MEs in DQMStore::localMEs_ where some are deleted so have bad data

jfernan2 commented 2 years ago

Thanks a lot @Dr15Jones

Since it looks that this is only happening for files with no events (as you found out yesterday), given the fact that other processed runs (with files which DO contain events) have not crashed so far, do you see feasible to add some protection/check for this kind of files in PreGlobalBeginLumi or endRunProduce?

Or do you think that any other solution should be found? Thank you!

Dr15Jones commented 2 years ago

See https://github.com/cms-sw/cmssw/pull/35751 and https://github.com/cms-sw/cmssw/pull/35752 for fix.

jfernan2 commented 2 years ago

Thank you very much!

tvami commented 2 years ago

Thanks @Dr15Jones !

Dr15Jones commented 2 years ago

So have what I believe is a better solution in #35756 and #35757.

jfernan2 commented 2 years ago

I have tried both solutions on the setups declared at the issue description, and the last one reported on https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2289/2/1/1/1.html some minutes ago and I get no crashes.

I trust @Dr15Jones opinion about #35756 and #35757 being better solutions

jfernan2 commented 2 years ago

I have signed the second solution, as discussed. Thank you!!

jfernan2 commented 2 years ago

+1 For the records

cmsbuild commented 2 years ago

This issue is fully signed and ready to be closed.

jfernan2 commented 2 years ago

@Dr15Jones @makortel

Despite the fix you nicely implemented here, it seems we are facing a similar problem: https://cms-talk.web.cern.ch/t/paused-jobs-for-promptreco-ephemeralhltphysics-0-19/13134/3 in this file (and other similar ones with almost no events and LSs...)

The file producing the crash has 28LS but 20 only events: /eos/cms/tier0/store/data/Run2022C/EphemeralHLTPhysics1/RAW/v1/000/355/831/00000/0a258b68-9081-4f92-9a10-9d44d3d6792e.root

Any idea of what can be happening now? Thank you

makortel commented 2 years ago

@jfernan2 Could you open a new issue? On a first look Assertion !assertLegacySafe_' failed. looks like a different problem than the one reported in this issue.

jfernan2 commented 2 years ago

Done @makortel https://github.com/cms-sw/cmssw/issues/38820 Thanks

cms-sw / cmssw

Possible bug in DQMStore::leaveLumi() #35726