Open makortel opened 2 years ago
assign core,dqm
New categories assigned: core,dqm
@jfernan2,@ahmad3213,@rvenditti,@Dr15Jones,@smuzaffar,@emanueleusai,@makortel,@pbo0,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @makortel Matti Kortelainen.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
In DQM/BeamMonitor/plugins/AlcaBeamMonitor.h
the problem is that per-LuminosityBlock data is stored in class member data instead of the LuminosityBlockCache
(which actually is empty!). For example, the globalBeginLuminosityBlock()
https://github.com/cms-sw/cmssw/blob/cbca23d3e7ae19960c90d6046413c5887c4a3791/DQM/BeamMonitor/plugins/AlcaBeamMonitor.cc#L227-L232
clears an std::vector<reco::VertexCollection> vertices_
and BeamSpotContainer beamSpotsMap_
(which is std::map<std::string, reco::BeamSpot>
).
In short, all lumi-dependent data should be moved into the LuminosityBlockCache
.
The vertices_
is filled in analyze()
https://github.com/cms-sw/cmssw/blob/cbca23d3e7ae19960c90d6046413c5887c4a3791/DQM/BeamMonitor/plugins/AlcaBeamMonitor.cc#L297-L300
and used in globalEndLuminosityBlock()
https://github.com/cms-sw/cmssw/blob/cbca23d3e7ae19960c90d6046413c5887c4a3791/DQM/BeamMonitor/plugins/AlcaBeamMonitor.cc#L363
The problem with this code is perhaps easiest demonstrated with e.g. a following sequence
beginLumi 1
Event 1:1
Event 1:2
beginLumi 2 # resets the `vertices_`
Event 2:1
Event 2:2
Event 1:3
endLumi 1 # vertices_ contains now data from Lumis 1 and 2
beginLumi 3
...
The beamspotsMap_
is filled in globalBeginLuminosityBlock()
, analyze()
, and in globalEndLuminosityBlock()
, and used in analyze()
and globalEndLuminosityBlock()
such that the value filled in globalBeginLuminosityBlock()
is used in analyze()
. The following sequence would lead to incorrect results
beginLumi 1
Event 1:1
Event 1:2
beginLumi 2, IOV of BeamSpotObjectsRcd changes
Event 1:3 # beamspotsMap_["DB"] has BeamSpot from Lumi 2
Event 2:1
endLumi 1 # beamspotsMap_["SC"] contains BeamSpot from Lumi 2; theBeamFitter and thePVFitter contain data from both Lumis(?)
Not related, but what is the aim of this fragment?
https://github.com/cms-sw/cmssw/blob/cbca23d3e7ae19960c90d6046413c5887c4a3791/DQM/BeamMonitor/plugins/AlcaBeamMonitor.cc#L304-L310
Event::getByToken()
does not throw any exception that would make sense for user code to catch. If the aim is to check if the input product exists, that could be done e.g. along
if (Handle<BeamSpot> recoBeamSpotHandle = iEvent.getHandle(scalerLabel_)) {
// product exists
} else {
// product doesn't exist
}
In DQM/BeamMonitor/plugins/OnlineBeamMonitor.h
the issue is only with beamSpotsMap_
, and is similar to the issue with the similarly named member variable in https://github.com/cms-sw/cmssw/issues/37163#issuecomment-1061301312.
In addition, about the other mutable
member variables
https://github.com/cms-sw/cmssw/blob/2105cce19672405d0e1c904d9c2b7c07aa4abde8/DQM/BeamMonitor/plugins/OnlineBeamMonitor.h#L60-L61
the mutable
could be removed from processedLumis_
by filling it in globalEndLuminosityBlock()
, and numberOfProcessedLumis_
appears to be unused.
Dear @makortel @francescobrivio @mmusich , I am working now on this issue (sorry to be taken so long). I have followed the suggestion by @makortel about the lumiCache and I have prepared a new branch with the updated code: https://github.com/MilanoBicocca-pix/cmssw/tree/lumiCacheForBeamSpot I hope to test it asap. I will then also fix the other things discussed in the issue
I should probably resume reviewing the modules listed in the issue description...
The DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc
stores some counters in member data in PotPlots
https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L153
and ChannelPlots
https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L206
nested structs. These counters are incremented in the analyze()
https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L822-L839
https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L1210-L1223
and are then used in globalEndLuminosityBlock()
to fill some histograms
https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L1295-L1301
https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L1305-L1313
In the presence of concurrent lumis these counters may have counts from other lumis than the one the globalEndLuminosityBlock()
is called for. Assuming this is important, the counters should be moved to the LuminosityBlockCache
. Digging in history it looks like these counters were missed in https://github.com/cms-sw/cmssw/pull/25188.
DQM/DTMonitorModule/interface/DTDataIntegrityTask.h
had an issue that I felt was easier to just fix, see https://github.com/cms-sw/cmssw/pull/41874
The DQM/HcalTasks/interface/RecHitTask.h
class has
https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/interface/RecHitTask.h#L120
which is set to false
in bookHistograms()
https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/plugins/RecHitTask.cc#L411
and
https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/plugins/RecHitTask.cc#L414-L417
(AFAICT both are effectively at beginRun)
The variable is set to true
in _process()
(i.e. in event processing) in
https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/plugins/RecHitTask.cc#L596-L600
and
https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/plugins/RecHitTask.cc#L685-L689
It is then used in globalEndLuminosityBlock()
https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/plugins/RecHitTask.cc#L842-L845
With concurrent lumis it can happen that for a lumi 1 the condition to set _unknownIdsPresent=true
does not happen, but in the following lumi 2 it happens, and the _unknownIdsPresent
is set to true
before the globalEndLuminosityBlock()
is called for lumi 1, leading to flag::fBAD
being reported for lumi 1.
This issue follows up my comment in https://github.com/cms-sw/cmssw/issues/25090#issuecomment-1061260133. Below is a list of
DQMOneEDAnalyzer
modules that make use ofedm::LuminosityCache
DQM/BeamMonitor/plugins/AlcaBeamMonitor.h
fixed in https://github.com/cms-sw/cmssw/pull/41700DQM/BeamMonitor/plugins/OnlineBeamMonitor.h
fixed in https://github.com/cms-sw/cmssw/pull/41700DQM/CTPPS/plugins/CTPPSCommonDQMSource.cc
DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc
DQM/CTPPS/plugins/DiamondSampicDQMSource.cc
DQM/CTPPS/plugins/TotemTimingDQMSource.cc
DQM/DTMonitorModule/interface/DTDataIntegrityTask.h
fixed in https://github.com/cms-sw/cmssw/pull/41874DQM/EcalMonitorTasks/interface/EcalDQMonitorTask.h
,DQM/EcalMonitorTasks/plugins/EcalDQMonitorTask.cc
DQM/EcalPreshowerMonitorModule/interface/ESIntegrityTask.h
DQM/HcalCommon/interface/DQTask.h
is a base class, deriving classes areDQM/HcalTasks/interface/HFRaddamTask.h
DQM/HcalTasks/interface/LEDTask.h
DQM/HcalTasks/interface/LaserTask.h
DQM/HcalTasks/interface/NoCQTask.h
DQM/HcalTasks/interface/PedestalTask.h
DQM/HcalTasks/interface/QIE10Task.h
DQM/HcalTasks/interface/QIE11Task.h
DQM/HcalTasks/interface/RawTask.h
DQM/HcalTasks/interface/RecHitTask.h
DQM/HcalTasks/interface/TPTask.h
DQM/HcalTasks/interface/UMNioTask.h
DQM/HcalTasks/plugins/HcalGPUComparisonTask.cc
DQM/L1TMonitor/interface/L1TdeRCT.h
DQM/Physics/src/QcdLowPtDQM.h
DQM/SiPixelMonitorDigi/interface/SiPixelDigiSource.h
DQM/SiStripMonitorDigi/interface/SiStripMonitorDigi.h
DQM/SiStripMonitorHardware/src/SiStripFEDMonitor.cc
DQM/TrigXMonitor/interface/L1Scalers.h
DQMOffline/L1Trigger/interface/L1TSync_Offline.h
On quick look some of them do not seem to be safe wrt concurrent lumis or IOVs. In the comments I'm going to write some notes about them.