Closed tvami closed 2 years ago
assign dqm
New categories assigned: dqm
@jfernan2,@ahmad3213,@rvenditti,@emanueleusai,@pbo0,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @tvami Tamas Vami.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
I guess I could also add that I see a lot of msg like
%MSG-w BeamFitter: AlcaBeamMonitor:AlcaBeamMonitor@endLumi 19-Oct-2021 10:36:02 CEST Run: 345687 Lumi: 28
No event read! No Fitting!
%MSG
According to the log on: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2289.html the module which is crashing (or am I confused by the log?) is: Module: RawTask:rawTask (crashed) which is defined in: DQM/HcalTasks/interface/RawTask.h
So, my first suspect is this commit: https://github.com/cms-sw/cmssw/commit/6e69bcf2e0450af08d75aacdd287c3c17ffc6e5c#diff-016dac97270a9f821146f1454911ea7cc5841dea61c3b3d582ce6df26aa47ddc I do not know why it may show up now... @lwang046 if you can have a look too, we may find the reason faster :-)
However, for the moment I did not reach the point of the crash with the instructions for the second crash above.
However, for the moment I did not reach the point of the crash with the instructions for the second crash above.
It supposed to be after LS 6, for the record I also had reproducibility issues, but doing it the second or third time I managed to reproduce the crash
The first crash on the other side, is stopping on: Module: EcalDQMonitorTask:ecalMonitorTask (crashed) defined in: DQM/EcalMonitorTasks/interface/EcalDQMonitorTask.h
So, we may be facing two different modules crashing when processing concurrent LS, I would point to this commit: https://github.com/cms-sw/cmssw/commit/9c1687e5e266d3887a118ef2596b365da80b4dca#diff-d0d3ee9239e421e02c7b49576a0ffd63979d2709ad81372f3652f2d915db1227 @tanmaymudholkar any idea?
A question: is the first time in a long time that we are processing concurrent lumisections centrally? Because the error is produced in this exact point related to concurrent LS: https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/src/DQMStore.cc#L504-L514
A question: is the first time in a long time that we are processing concurrent lumisections centrally? Because the error is produced in this exact point related to concurrent LS: https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/src/DQMStore.cc#L504-L514
maybe @makortel knows?
Concurrent lumis have been used in RelVals since 11_3_0_pre1 when they were enabled by default in multithreaded jobs (in principle at least, then if all RelVal jobs would process only one lumi per job that would lead to the feature effectively not being tested).
From the log, the job failed because of an assert failure
cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_2-slc7_amd64_gcc900/build/CMSSW_12_0_2-build/tmp/BUILDROOT/2faca74e5639e00b55cc90eec0352f77/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_2/src/DQMServices/Core/src/DQMStore.cc:512: dqm::implementation::DQMStor
e::leaveLumi(edm::RunNumber_t, edm::LuminosityBlockNumber_t, uint64_t)::<lambda(MonitorElementData::Scope)>: Assertion `!"Impossible Scope."' failed.
Concurrent lumis have been used in RelVals since 11_3_0_pre1 when they were enabled by default in multithreaded jobs (in principle at least, then if all RelVal jobs would process only one lumi per job that would lead to the feature effectively not being tested).
Yes, it may be the case, first run processed centrally in an effective concurrent LS way. I am trying to debug in those exact lines from DQMStore printing the right point where it crashes, it looks like several modules (in a random way) produce teh crash but always when:
run 345687 lumi 0 scope 2
In gdb
I see an offending MonitorElement has
run_ = 0, luminosityBlock_ = 0}, scope_ = 0, kind_ = MonitorElementData::Kind::INVALID
with a path JetMET/MET/pfMETT1/Cleaned/METSig
with crash coming from METAnalyzer:pfMetT1DQMAnalyzer
.
The file being used has no events
080fd1ac-818f-4a2d-a10c-a55415610c20.root (1 runs, 28 lumis, 0 events, 110098 bytes)
Printing IndexIntoFile contents. This includes a list of all Runs, LuminosityBlocks
and Events stored in the root file.
Run Lumi Event TTree Entry
345687 0 0 0 (Run)
345687 1 0 0 (Lumi)
345687 2 0 1 (Lumi)
345687 3 0 2 (Lumi)
345687 4 0 3 (Lumi)
345687 5 0 4 (Lumi)
345687 6 0 5 (Lumi)
345687 7 0 6 (Lumi)
345687 8 0 7 (Lumi)
345687 9 0 8 (Lumi)
345687 10 0 9 (Lumi)
345687 11 0 10 (Lumi)
345687 12 0 11 (Lumi)
345687 13 0 12 (Lumi)
345687 14 0 13 (Lumi)
345687 15 0 14 (Lumi)
345687 16 0 15 (Lumi)
345687 17 0 16 (Lumi)
345687 18 0 17 (Lumi)
345687 19 0 18 (Lumi)
345687 20 0 19 (Lumi)
345687 21 0 20 (Lumi)
345687 22 0 21 (Lumi)
345687 23 0 22 (Lumi)
345687 24 0 23 (Lumi)
345687 25 0 24 (Lumi)
345687 26 0 25 (Lumi)
345687 27 0 26 (Lumi)
345687 28 0 27 (Lumi)
and therefore concurrent lumis are going to be happenning. When there are events in the input file the overlaps of lumis would probably be much less frequent.
In
gdb
I see an offending MonitorElement hasrun_ = 0, luminosityBlock_ = 0}, scope_ = 0, kind_ = MonitorElementData::Kind::INVALID
with a path
JetMET/MET/pfMETT1/Cleaned/METSig
with crash coming fromMETAnalyzer:pfMetT1DQMAnalyzer
.
The METSig
is a lumi-scoped histogram
https://github.com/cms-sw/cmssw/blob/596662e79c88655c56891219ecaf687fc136953d/DQMOffline/JetMET/src/METAnalyzer.cc#L343-L346
The assert is firing from DQMOneEDAnalyzer<>::endRunProduce()
.
Thanks a lot @makortel and @Dr15Jones ! As expected, you were much faster than me finding the issue. However, do you get always the same crash from METAnalyzer? Because I have also seen crashes from DQM/EcalMonitorTasks/interface/EcalDQMonitorTask.h and DQM/HcalTasks/interface/RawTask.h as I pointed out
In
gdb
I see an offending MonitorElement hasrun_ = 0, luminosityBlock_ = 0}, scope_ = 0, kind_ = MonitorElementData::Kind::INVALID
with a path
JetMET/MET/pfMETT1/Cleaned/METSig
with crash coming fromMETAnalyzer:pfMetT1DQMAnalyzer
.
Over the several runs I see that crash for 28 Lumi happens also for MonitorElements with name & path: UnknownIds & Hcal/RecHitTask/ UnknownIds & Hcal/TPTask/ in addition to METSig & JetMET/MET/caloMet/Cleaned/ due to uncorrected value of MonitorElement->getScope() value like -1867361216
@makortel @Dr15Jones do you think we should implement a check to require >0 events at the merging level? Would that help?
@makortel @Dr15Jones unless you have a smart idea, I don't see an easy way to debug all the offedning modules in a fast way. I wonder if there is any option in the FW to skip files with no events at processing level....which may give us some time
If a quick workaround is needed, that would likely be process.numberOfConcurrentLuminosityBlocks = 1
.
Yes, but wouldn't that be a "waste" of resources?
That would give analysts Incorrect answers for any physics data...
On Oct 19, 2021, at 5:24 PM, rappoccio @.***> wrote:
@makortel @Dr15Jones do you think we should implement a check to require >0 events at the merging level? Would that help?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
If a quick workaround is needed, that would likely be
process.numberOfConcurrentLuminosityBlocks = 1
.
@makortel should this be done as default option in: https://github.com/cms-sw/cmssw/blob/d68bfe91bc4faa34756931b98318d1f4a72d60e1/FWCore/ParameterSet/python/Config.py#L2053
or
Or perhaps just pàssing the parameter to the current Jobs being done by Operation managers? Thanks
Since we're talking about a "hot workaround", I think the highest level that is convenient would be best. E.g. along
--nConcurrentLumis=1
argumentprocess.numberOfConcurrentLuminosityBlocks = 1
thereYou need to change things in the release. - one of the last two choices will work.
On Oct 20, 2021, at 2:43 PM, Matti Kortelainen @.***> wrote:
Since we're talking about a "hot workaround", I think the highest level that is convenient would be best. E.g. along
• if cmsDriver is used, use --nConcurrentLumis=1 argument • if it is easy to tweak the job configurations, add process.numberOfConcurrentLuminosityBlocks = 1 there • if cmsDriver is not used, but ConfigBuilder is, set the default there https://github.com/cms-sw/cmssw/blob/0f0430f4415229a0695a84a5c9331636d3525772/Configuration/Applications/python/ConfigBuilder.py#L90 • only if nothing else works, we could consider changing the framework default — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
So I have found the order of operations that cause the problem. (Note ME is shorthand for MonitorElement)
book histogram injects ME into DQMStore::globalMEs with run=lumi=0 and the same one into DQMStore::localMEs with run=lumi=0
PreGlobalBeginLumi Service Call for lumi 1
PreGlobalBeginLumi Service Call for lumi 2
PostGlobalWriteLumi Service Call
PostGlobalWriteLumi Service Call
called from DQMOneEDAnalyzer::endRunProduce
Thanks a lot @Dr15Jones
Since it looks that this is only happening for files with no events (as you found out yesterday), given the fact that other processed runs (with files which DO contain events) have not crashed so far, do you see feasible to add some protection/check for this kind of files in PreGlobalBeginLumi or endRunProduce?
Or do you think that any other solution should be found? Thank you!
Thank you very much!
Thanks @Dr15Jones !
So have what I believe is a better solution in #35756 and #35757.
I have tried both solutions on the setups declared at the issue description, and the last one reported on https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2289/2/1/1/1.html some minutes ago and I get no crashes.
I trust @Dr15Jones opinion about #35756 and #35757 being better solutions
I have signed the second solution, as discussed. Thank you!!
+1 For the records
This issue is fully signed and ready to be closed.
@Dr15Jones @makortel
Despite the fix you nicely implemented here, it seems we are facing a similar problem: https://cms-talk.web.cern.ch/t/paused-jobs-for-promptreco-ephemeralhltphysics-0-19/13134/3 in this file (and other similar ones with almost no events and LSs...)
The file producing the crash has 28LS but 20 only events: /eos/cms/tier0/store/data/Run2022C/EphemeralHLTPhysics1/RAW/v1/000/355/831/00000/0a258b68-9081-4f92-9a10-9d44d3d6792e.root
Any idea of what can be happening now? Thank you
@jfernan2 Could you open a new issue? On a first look Assertion !assertLegacySafe_' failed.
looks like a different problem than the one reported in this issue.
Done @makortel https://github.com/cms-sw/cmssw/issues/38820 Thanks
There seems to be a bug in DQMStore::leaveLumi(), please see https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2289.html for the PromptReco_Run345687_HLTPhysics wf and then https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2289/2.html PromptReco_Run345595_HcalNZS
While the first one (345687) had this special DQM key in for a test, the second run seems very much like a vanilla run, which would mean that this is just a bug that comes up time to time.
The way to reproduce the first one
and then the second one