cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

DQM Modules and LuminosityBlock MonitorElements #25106

Open Dr15Jones opened 5 years ago

Dr15Jones commented 5 years ago

The DQMEDAnalyzers were converted to edm::one modules which use the options edm::one::WatchRuns and edm::one::WatchLuminosityBlocks . These options cause the framework to then synchronize on those transition boundaries which severely affects the threading efficiency of the jobs in which they are run.

To make DQMEDAnalyzers thread efficient, I propose first dealing with the LuminosityBlock transitions and then move to dealing with teh Run transitions. This could be done by

  1. convert DQMEDAnalyzer to just be one::DQMEDAnalyzer<>. This would make this class only work with Run based MonitorElements.
  2. make all modules that need LuminosityBlock based MonitorElements explicitly inherit from one::DQMEDAnalyzer<one::DQMLuminosityBlockElements>.
  3. require that all LuminosityBlock based MonitorElements be booked at begin LuminosityBlock
    1. IBooker would change to have the option of setting the Lumi number when creating MonitorElements
    2. at end LuminosityBlock of a module, the MonitorElements of a given module would be moved from begin assigned to the module to global access. This would avoid the need to make a copy.
    3. one::DQMEDAnalyzer<one::DQMLuminosityBlockElements> would be changed to use edm::LuminosityCache to hold onto the MonitorElements booked at begin LuminosityBlock time. This will probably entail changing one::DQMLuminosityBlockElements to take a template argument.
    4. MonitorElement::setLumiFlag would not be accessible to user code as that would be handled directly by the IBooker.

This design change is very similar to what was done with DQMGlobalEDAnalyzer.

The Core group would be willing to assist in doing this code conversion.

cmsbuild commented 5 years ago

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 5 years ago

assign dqm, core

cmsbuild commented 5 years ago

New categories assigned: dqm,core

@Dr15Jones,@smuzaffar,@jfernan2,@andrius-k,@schneiml,@kmaeshima you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 5 years ago

ESIntegrityTask in DQM/EcalPreshowerMonitorModule calls DQMStore::softReset and DMQStore::disableSoftReset based on if it wants some of the MonitorElements to be Lumi based or not (which is chosen by the configuration).

In the regime of concurrently LuminosityBlocks, what behavior should this module have?

Dr15Jones commented 5 years ago

DQTask in DQM/HcalCommon is a base class for many modules in DQM/HcalTasks. Some of these may be creating histograms, while others do not. This may require DQTask to become a template to handle any caching or lumi based histograms.

Dr15Jones commented 5 years ago

As an experiment, I changed DQMEDAnalyzer to just be one::DQMEDAnalyzer<> and this broke 71 modules.

Dr15Jones commented 5 years ago

See #25114 and #25115

Dr15Jones commented 5 years ago

See #25124 , #25125 and #25128

Dr15Jones commented 5 years ago

See #25144 #25145 and #25146

Dr15Jones commented 5 years ago

See #25161

schneiml commented 5 years ago

Hi Chris,

I am not that happy with your proposal since I had a bit different solution in mind. Though I only have an idea in mind and no working code for now, so I can't promise to much.

First, I'd like to switch DQMEDAnalyzer back to edm::stream. Since this migration was done in the past, it should be reasonably easy. But we have to do it soon, before to many modules rely on being edm::one.

Then, use the fact that we have MonitorElement as a wrapper around histograms, so we can add some smartness there. Even if we still allow access to the bare TH1 to some extent, we can e.g. decide to hand out different TH1 depending on the LS (ideally pointing into a LuminosityBlockCache or similar, since I'd like the storage decentralized as well).

The reason why I am not a fan of the re-booking solution is that it requires fairly invasive changes to the subsystem code. I prefer to be able to turn a different granularity of saving by configuration, maybe even globally.

Finally, we need some awareness of the EventSetup and Conditions in DQM, for multi-run harvesting and friends. The booking must be able to depend on the EventSetup so we can handle geometry/topology dependent booking (upgrades etc.), but also we should make sure the booking can not arbitrarily change every LS (so that we can merge over intervals longer than the run, if possible.

Of course, some technical cleanup in the modules does not hurt. But there I'd like to see a test coverage report to have an idea which modules are tested at all, used in Tier0/other workflows, running in online, etc.

Dr15Jones commented 5 years ago

@schneiml Thanks for your for your detailed response. A few things are unclear to me, however.

I assume you still want to avoid the copy of each histogram per stream which was done previously when using edm::stream modules. So if you are planning on sharing the same underlying histograms between different instances of the same edm::stream module?

When processing concurrent LuminosityBlocks, how do you see handling the case where you have a Lumi based histogram, but one edm::stream instance is still working on Lumi # 1 but another one has moved on to Lumi # 2? Clearly, they can not be filling the same underlying ROOT histogram.

schneiml commented 5 years ago

@Dr15Jones yes, I do plan to have only a single backing copy and locks (or a thread-safe TH1, if that ever happens). For the cross-lumi case, I'd like the ME (which will be duplicated for each of the streams) to point to different backing objects, in the spirit of LuminosityBlockCache. Of course, that can be avoided for per-run histograms, but then the same thing applies for run boundaries.

So, it seems to me that fixing this once properly gives us a clean solution for a number of problems, and should still be possible with a trivial migration for "well-behaving" modules. Of course, we'll need to remove a lot of legacy features, but I hope they are not used in too many places.

But then I have not looked at the code for a while, maybe I got overly optimistic...

Dr15Jones commented 5 years ago

See #25188

This was already to go this morning, so I decided to still push it to allow the system to do a more thorough testing.

Dr15Jones commented 5 years ago

For the cross-lumi case, I'd like the ME (which will be duplicated for each of the streams) to point to different backing objects, in the spirit of LuminosityBlockCache. Of course, that can be avoided for per-run histograms, but then the same thing applies for run boundaries.

So you are thinking about the MEs (which are member data of a given DQM module) holding a 'pointer to a pointer to a pointer to a THthing'? The ME's 'pointer to ...' would then point to a memory location held by the edm::stream instance (which is inheriting from stream::DQMEDAnalyzer). That memory location would then point to a piece of memory which holds the pointer of the THthing and the std::mutex used to guard it. Then on ever streamBeginLuminosityBlock the given stream::DQMEDAnalyzer instance would change the pointers it holds to point to new 'pointer to THthing + std::mutex groups specific to that LuminosityBlock, where presumeably it has gotten from the global DQMStore so all the different edm::stream instances will get the same THthing for the same LuminosityBlock.

As for your statement of run boundaries, you already have that covered since the bookHistograms interface is already being called new for each Run which means for edm::stream you would already have the proper behavior without any changes.

schneiml commented 5 years ago

holding a 'pointer to a pointer to a pointer to a THthing'?

As far as I see, the modules usually hold pointers (bare pointers, sadly...) to the MEs, so we already have the required double indirection.

Otherwise, your proposal sounds roughly like what I was thinking about. Where to actually store the THthings does not matter much, can be in the edm::stream module or the DQMStore, as long as we track ownership correctly. What does matter is that some central entity (DQMStore) has pointers to all of the histograms, so we can still do "global" operations. Though, actually I don't currently see a need for any -- edm-based saving could be handled by Products, and the old DQMStore::save (that we will have to provide for various special cases) should only work on the MEs of the current module. And snapshots each n lumis (which I'd like to see for various reasons) need to be handled locally as well, to avoid globally synhronizing on lumi boundaries.

For the synchronization for threaded access, I have not thought about that much for now. I guess what works for the ConcurrentME should also work for this solution, and getting some form of fillAtomic as a new API in ROOT seemed possible as well. The interesting question is what to do with the modules that do use the bare TH1 pointer: as long it does not get stored, we could maybe hand out a smart pointer that keeps a lock until it goes out of scope? Though then we get closer to edm::one semantics if this feature is used. Disallowing TH1 access also seems possible, if we keep a "legacy" API around for harvesting.

Dr15Jones commented 5 years ago

You do not want to use a lock with a long scope. Way to easy to get into deadlock situations. Instead you could use the same mechanism as DQMStore::bookTransaction uses to get access to the IBooker. So you would get rid of all the getTH* functions and replace them something like

template<typename F> void accessTH1F( F&& iFunc);

Where iFunc is required to take as argument a TH1F* . Then inside accessTH1F you take the lock and when the return from it you release the lock.

schneiml commented 5 years ago

Yep, that should work, though it is a bit more intrusive. Let's hope there is not many cases that need to use this...

jfernan2 commented 3 years ago

Is this issue still alive? My impression is it is NOT, given the changes made by Marcel on DQMStore in 11_1_X

makortel commented 3 years ago

Good question. I think too that most of the issue has been addressed.

There is still code using getTH1(), which are also discussed in #29096.

There are still some DQMOneLumiEDAnalyzer modules, which are also discussed in https://github.com/cms-sw/cmssw/issues/25090.

@Dr15Jones What do you think, are there aspects in this issue that are not being followed up (also) elsewhere?