cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Issues in concurrent Lumi/IOV safety in DQMOneEDAnalyzer<LuminosityBlockCache<...>> #37163

Open makortel opened 2 years ago

makortel commented 2 years ago

This issue follows up my comment in https://github.com/cms-sw/cmssw/issues/25090#issuecomment-1061260133. Below is a list of DQMOneEDAnalyzer modules that make use of edm::LuminosityCache

On quick look some of them do not seem to be safe wrt concurrent lumis or IOVs. In the comments I'm going to write some notes about them.

makortel commented 2 years ago

assign core,dqm

cmsbuild commented 2 years ago

New categories assigned: core,dqm

@jfernan2,@ahmad3213,@rvenditti,@Dr15Jones,@smuzaffar,@emanueleusai,@makortel,@pbo0,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 years ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 2 years ago

In DQM/BeamMonitor/plugins/AlcaBeamMonitor.h the problem is that per-LuminosityBlock data is stored in class member data instead of the LuminosityBlockCache (which actually is empty!). For example, the globalBeginLuminosityBlock() https://github.com/cms-sw/cmssw/blob/cbca23d3e7ae19960c90d6046413c5887c4a3791/DQM/BeamMonitor/plugins/AlcaBeamMonitor.cc#L227-L232 clears an std::vector<reco::VertexCollection> vertices_ and BeamSpotContainer beamSpotsMap_ (which is std::map<std::string, reco::BeamSpot> ).

In short, all lumi-dependent data should be moved into the LuminosityBlockCache.


The vertices_ is filled in analyze() https://github.com/cms-sw/cmssw/blob/cbca23d3e7ae19960c90d6046413c5887c4a3791/DQM/BeamMonitor/plugins/AlcaBeamMonitor.cc#L297-L300 and used in globalEndLuminosityBlock() https://github.com/cms-sw/cmssw/blob/cbca23d3e7ae19960c90d6046413c5887c4a3791/DQM/BeamMonitor/plugins/AlcaBeamMonitor.cc#L363

The problem with this code is perhaps easiest demonstrated with e.g. a following sequence

beginLumi 1
Event 1:1
Event 1:2
beginLumi 2 # resets the `vertices_`
Event 2:1 
Event 2:2
Event 1:3
endLumi 1  # vertices_ contains now data from Lumis 1 and 2
beginLumi 3
...

The beamspotsMap_ is filled in globalBeginLuminosityBlock(), analyze(), and in globalEndLuminosityBlock(), and used in analyze() and globalEndLuminosityBlock() such that the value filled in globalBeginLuminosityBlock() is used in analyze(). The following sequence would lead to incorrect results

beginLumi 1
Event 1:1
Event 1:2
beginLumi 2, IOV of BeamSpotObjectsRcd changes
Event 1:3 # beamspotsMap_["DB"] has BeamSpot from Lumi 2
Event 2:1
endLumi 1 # beamspotsMap_["SC"] contains BeamSpot from Lumi 2; theBeamFitter and thePVFitter contain data from both Lumis(?)

Not related, but what is the aim of this fragment? https://github.com/cms-sw/cmssw/blob/cbca23d3e7ae19960c90d6046413c5887c4a3791/DQM/BeamMonitor/plugins/AlcaBeamMonitor.cc#L304-L310 Event::getByToken() does not throw any exception that would make sense for user code to catch. If the aim is to check if the input product exists, that could be done e.g. along

if (Handle<BeamSpot> recoBeamSpotHandle = iEvent.getHandle(scalerLabel_)) {
  // product exists
} else {
  // product doesn't exist
}
makortel commented 2 years ago

In DQM/BeamMonitor/plugins/OnlineBeamMonitor.h the issue is only with beamSpotsMap_, and is similar to the issue with the similarly named member variable in https://github.com/cms-sw/cmssw/issues/37163#issuecomment-1061301312.

In addition, about the other mutable member variables https://github.com/cms-sw/cmssw/blob/2105cce19672405d0e1c904d9c2b7c07aa4abde8/DQM/BeamMonitor/plugins/OnlineBeamMonitor.h#L60-L61 the mutable could be removed from processedLumis_ by filling it in globalEndLuminosityBlock(), and numberOfProcessedLumis_ appears to be unused.

gennai commented 1 year ago

Dear @makortel @francescobrivio @mmusich , I am working now on this issue (sorry to be taken so long). I have followed the suggestion by @makortel about the lumiCache and I have prepared a new branch with the updated code: https://github.com/MilanoBicocca-pix/cmssw/tree/lumiCacheForBeamSpot I hope to test it asap. I will then also fix the other things discussed in the issue

makortel commented 1 year ago

I should probably resume reviewing the modules listed in the issue description...

makortel commented 1 year ago

The DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc stores some counters in member data in PotPlots https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L153 and ChannelPlots https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L206 nested structs. These counters are incremented in the analyze() https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L822-L839 https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L1210-L1223 and are then used in globalEndLuminosityBlock() to fill some histograms https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L1295-L1301 https://github.com/cms-sw/cmssw/blob/b2104010566f44c0bb808f4b0159df0239e4885a/DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc#L1305-L1313

In the presence of concurrent lumis these counters may have counts from other lumis than the one the globalEndLuminosityBlock() is called for. Assuming this is important, the counters should be moved to the LuminosityBlockCache. Digging in history it looks like these counters were missed in https://github.com/cms-sw/cmssw/pull/25188.

makortel commented 1 year ago

DQM/DTMonitorModule/interface/DTDataIntegrityTask.h had an issue that I felt was easier to just fix, see https://github.com/cms-sw/cmssw/pull/41874

makortel commented 1 year ago

The DQM/HcalTasks/interface/RecHitTask.h class has https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/interface/RecHitTask.h#L120 which is set to false in bookHistograms() https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/plugins/RecHitTask.cc#L411 and https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/plugins/RecHitTask.cc#L414-L417 (AFAICT both are effectively at beginRun)

The variable is set to true in _process() (i.e. in event processing) in https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/plugins/RecHitTask.cc#L596-L600 and https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/plugins/RecHitTask.cc#L685-L689

It is then used in globalEndLuminosityBlock() https://github.com/cms-sw/cmssw/blob/57ceb236a55f7133e00d9f38a7c81fd57f3cfd78/DQM/HcalTasks/plugins/RecHitTask.cc#L842-L845

With concurrent lumis it can happen that for a lumi 1 the condition to set _unknownIdsPresent=true does not happen, but in the following lumi 2 it happens, and the _unknownIdsPresent is set to true before the globalEndLuminosityBlock() is called for lumi 1, leading to flag::fBAD being reported for lumi 1.