Is there a better way to track time dependent information in DQM?

davidlange6 commented 7 years ago

We are getting requests for a lot of new histograms with ~2500 bins on the x axis (sometimes profiles, sometimes 2d) whose purpose is to monitor things vs lumi section. (2500 corresponds to a 16 hour run...) Eg, I think one PR has 500 (y bins) x2500 (lumis)x 10 (instances) [then x8 cores = 100MB)

While perhaps one job will process an entire run in the online, I would think no more than a handful (or two) lumi sections are processed in a prompt reco job. (so >25x too much memory being allocated)

So, do we not have a way to move this memory expensive construct to where the statistics for a run are aggregated together? If not are there thoughts on how to make one?

davidlange6 commented 7 years ago

assign dqm

cmsbuild commented 7 years ago

New categories assigned: dqm

@vanbesien,@vazzolini,@kmaeshima,@dmitrijus you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 7 years ago

A new Issue was created by @davidlange6 David Lange.

@davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

davidlange6 commented 7 years ago

assign core

cmsbuild commented 7 years ago

New categories assigned: core

@Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 7 years ago

@davidlange6 I would think that the aggregation across lumis could be done in the Harvesting step.

davidlange6 commented 7 years ago

can you think of an example to point people towards?

On Jun 16, 2017, at 4:19 PM, Chris Jones notifications@github.com wrote:

@davidlange6 I would think that the aggregation across lumis could be done in the Harvesting step.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dr15Jones commented 7 years ago

I have no idea if anyone has done such a thing, I was only saying that I could image one could do it. The Harvesting step allows someone to read MonitorElements from the DQMStore and then create new MonitorElements based on those previous ones. Off the top of my head I can't think of why the read MonitorElements couldn't come from the LuminosityBlock transition and then the final MonitorElement is finished at end job time (which is where the DQMEDHarvester does its final call for MonitorElements).

VinInn commented 7 years ago

I think we should ask root to implement histograms that do not allocate (interval of) bins if they are not filled. can be done under the hood... (I book 0,2500, I use only 11 it books only one bin corresponding to 11, etc) when are summed the whole interval (say 9 250) is properly filled.... pretty trivial

davidlange6 commented 7 years ago

a good thought -I realize its already done for the example of lumi section # if we know the lumi section number at booking time - make a one bin histogram and set SetCanExtend(). but can the lumi section be known when the histrograms are booked? I would think not..

On Jun 16, 2017, at 7:09 PM, Vincenzo Innocente notifications@github.com wrote:

I think we should ask root to implement histograms that do not allocate (interval of) bins if they are not filled. can be done under the hood... (I book 0,2500, I use only 11 it books only one bin corresponding to 11, etc) when are summed the whole interval (say 9 250) is properly filled.... pretty trivial

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

davidlange6 commented 7 years ago

or has anyone used THnSparse?

On Jun 16, 2017, at 7:34 PM, David Lange David.Lange@cern.ch wrote:

a good thought -I realize its already done for the example of lumi section # if we know the lumi section number at booking time - make a one bin histogram and set SetCanExtend(). but can the lumi section be known when the histrograms are booked? I would think not..

On Jun 16, 2017, at 7:09 PM, Vincenzo Innocente notifications@github.com wrote:

I think we should ask root to implement histograms that do not allocate (interval of) bins if they are not filled. can be done under the hood... (I book 0,2500, I use only 11 it books only one bin corresponding to 11, etc) when are summed the whole interval (say 9 250) is properly filled.... pretty trivial

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

davidlange6 commented 7 years ago

Ignore SetCanExtend - it doesn't change the # of bins when extending the axis...

On Jun 16, 2017, at 7:41 PM, David Lange David.Lange@cern.ch wrote:

or has anyone used THnSparse?

On Jun 16, 2017, at 7:34 PM, David Lange David.Lange@cern.ch wrote:

a good thought -I realize its already done for the example of lumi section # if we know the lumi section number at booking time - make a one bin histogram and set SetCanExtend(). but can the lumi section be known when the histrograms are booked? I would think not..

On Jun 16, 2017, at 7:09 PM, Vincenzo Innocente notifications@github.com wrote:

I think we should ask root to implement histograms that do not allocate (interval of) bins if they are not filled. can be done under the hood... (I book 0,2500, I use only 11 it books only one bin corresponding to 11, etc) when are summed the whole interval (say 9 250) is properly filled.... pretty trivial

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

VinInn commented 7 years ago

otherwise do-it-yourself a map (with one single entry most of the time) of single bin histos and fill the proper histo in harvesting A waste of intelligence, ok, but is ROOT developers' intelingence...

venturia commented 7 years ago

@davidlange6 "ignore SetCanExtend" if you want to be sure about having one bin for each lumisection BUT, on the other hand, if you want to have trend plots of average values where one bin is one or more (but not too many) lumisections, SetCanExtend is the right thing to do. This is what we are doing in the Tracker and we are pretty happy about them. SetCanExtend does everything for us: change the binning when the number of lumisections pass a threshold value and handles properly the mering of the histograms (TProfile) even if they end up having different number of bins.

davidlange6 commented 7 years ago

interesting - perhaps it works differently for TProfiles than TH1Fs (in fact, probably it has to be the case given what a TProfile does) - will try to confirm.

It still doesn't suppress the 0 on the x-axis automatically however... so if you are processing lumi section 25000, you have 25000 bins..

On Jun 19, 2017, at 10:08 AM, venturia notifications@github.com wrote:

@davidlange6 "ignore SetCanExtend" if you want to be sure about having one bin for each lumisection BUT, on the other hand, if you want to have trend plots of average values where one bin is one or more (but not too many) lumisections, SetCanExtend is the right thing to do. This is what we are doing in the Tracker and we are pretty happy about them. SetCanExtend does everything for us: change the binning when the number of lumisections pass a threshold value and handles properly the mering of the histograms (TProfile) even if they end up having different number of bins.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

venturia commented 7 years ago

OK, but we do not book with 25000 bins... One has to be much much less demanding.

davidlange6 commented 7 years ago

indeed, i was exaggerating by a factor of 10 - but 2500 is routine.

On Jun 19, 2017, at 10:33 AM, venturia notifications@github.com wrote:

OK, but we do not book with 25000 bins... One has to be much much less demanding.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

venturia commented 7 years ago

2500 seems to be one bin per LS for runs long as much as ~16 hours. It looks conservative to me. I see no problem to go to a factor 4 less for plots that do not require, strictly, "one LS in one bin".

fwyzard commented 7 years ago

The "waste" (whether we have 2500 or a factor 4 less bins) is that the Prompt and Express jobs usual work on few lumisections at a time. So a sparse histogram storing (and merging) only the non-empty bins would save at least a factor 100.

dmitrijus commented 7 years ago

'SetCanExtend' is extremely problematic for DQM. It makes merging histograms very slow, TH1::Add simply rejects such histograms.

This is the 'merging' which is done at several levels: between two threads, then opening files for harvesting and for merging histograms coming out of Online HLT. If you set 'kCanExtend', you somewhat destroy the performance of all of those.

The same exact worry applies to THnSparse. It supports THnSparse::Add, but the implementation seems somewhat slow. Can not say by how much.

Also, THnSparse uses more memory and is much slower at pretty much every operation. If you fill 1/100 of it, it might be faster, but once it is fully filled - it will be much slower and take at least twice as much memory as the regular TH1.

Even bigger issue is that it makes memory usage no longer fixed. TH1 allocates everything once the booking is done, so if cmssw decides to crash due to maxRSS, it will do so within first event. If we go for THnSparse, the memory grows over time and the crash no longer happens at the convenient first event.

davidlange6 commented 7 years ago

On Jun 19, 2017, at 5:42 PM, Dmitrijus notifications@github.com wrote:

'SetCanExtend' is extremely problematic for DQM. It makes merging histograms very slow, TH1::Add simply rejects such histograms.

This is the 'merging' which is done at several levels: between two threads, then opening files for harvesting and for merging histograms coming out of Online HLT. If you set 'kCanExtend', you somewhat destroy the performance of all of those.

The same exact worry applies to THnSparse. It supports THnSparse::Add, but the implementation seems somewhat slow. Can not say by how much.

Also, THnSparse uses more memory and is much slower at pretty much every operation. If you fill 1/100 of it, it might be faster, but once it is fully filled - it will be much slower and take at least twice as much memory as the regular TH1.

right, so don't fully fill - merge into a TH1F/TH2F.. Shall be interesting to benchmark the two. Can DQM support a THnSparse?

Even bigger issue is that it makes memory usage no longer fixed. TH1 allocates everything once the booking is done, so if cmssw decides to crash due to maxRSS, it will do so within first event. If we go for THnSparse, the memory grows over time and the crash no longer happens at the convenient first event.

crashes in the real world CMS don't happen on the "first event". They happen far into the job as memory creeps up (or on busy events).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

fwyzard commented 7 years ago

An alternative would be to roll our own sparse histogram class (and possibly contribute it back to root). For example, reading the documentation and that of THnSparse, Eigen sparse matrices look to be much more optimised memory-wise, and could be converted to non-sparse matrices once they are filled above a given threshold.

But we do need somebody to work on it, and benchmark it all..

P.S. loosely related, I've started to play with "histograms" in multiple threads using atomic vs mutexes

Dr15Jones commented 7 years ago

Jim Pivarski did a study and showed atomics are a good choice. He had a talk in Core.

fwyzard commented 7 years ago

Do you have a link ?

Dr15Jones commented 7 years ago

I was wrong, Jim only send that work to me. @jpivarski would you be willing to share your concurrent histogram filling results here?

jpivarski commented 7 years ago

I did a study of dense, concurrent histograms https://indico.cern.ch/event/607830/contributions/2613395/ to entice someone to implement them as a part of ROOT.

What I found is that copy-and-swap and std::atomic algorithms scale extremely well— the horizontal axis is dozens of threads— and that the failure to scale is dominated by something other than contention (two threads wanting to increment the same bin at the same time). We know this because the scaling of a naive implementation (hist[i]++) is not much better. At these GHz rates, it's probably a hardware limitation— serializing on the memory bus or copying from CPU cache to main memory.

Sparse histograms would be a different story, because that's more like a hash map than an array. But given the success of std::atomic[NUM_BINS], it would be worth trying std::map<long, std::atomic>. Or the copy-and-swap method, because the you can see what's going on and measure collision rate. (std::atomic mysteriously does better than the naive implementation in some cases, and not understanding that makes me wary).

Oh, and I've put all the code for these studies on https://github.com/diana-hep/parallel-histogram-metrics . Feel free to fork it.

davidlange6 commented 7 years ago

I had a look at using THnSparse vs TH2F.

I made a giant 2D histogram 2500 in x (eg, lumi sections) and 10k in y (random variable to plot). Then I filled the plots one lumi section at a time (100k times per "lumi section"). I did this for a number of lumi sections (either 4 or 1200). Finally I added histograms one by one to get the total sum (to simulate merging results across jobs)

When using 4 "lumi sections"

THnSparse is considerably faster. (the accuracy of my measurement is not great, but something like 6x)
THnSparse has much less memory allocations (new delete a histogram for each lumi)

When using 1200 "lumi sections"

THnSparse is considerably faster. (40 ticks vs 220 ticks)
THnSparse has much less memory allocations (new delete a histogram for each lumi)
THnSparse takes much more memory for the final histogram (9x!)

So, bugs aside, assuming not that many lumis are processed in one job and a not-too-memory hungry merge, this works pretty well. (faster than what we currently do). I agree that other solutions could be better - including eigen, or some data structure that knew we were sparse in one axis and not the others...

dmitrijus commented 7 years ago

Can DQM support a THnSparse?

We can and it would most definitely solve Online HLT case, where only one bin per lumi is filled (in a lumi-binned histogram). My biggest worry is simply misuse. Immediately, after we enable it, people will start using it for everything and in ways we have not yet imagined.

In fact, this entire case we are discussing now is a misinterpretation of what a histogram is. We are no longer interested in the distribution of some variable, but a variable's evolution over time (= time series).

I would love to take a look at THnSparse, but unfortunately I am too busy this week. I can't even review the pull requests properly...

Atomics are great to see, but they have to be supported by and integrated into root's TH1. All design decisions of DQM were around TH1 and it's features. Rolling our own implementation might do the trick, but there is simply too much code which relies on TH1 logic. A lot of histograms are not 'standard' histograms too, they are populated with setBinContent() rather than Fill().

fwyzard commented 7 years ago

I opened #19395 for the discussion about concurrency.

dmitrijus commented 7 years ago

@davidlange6 Thanks for the test.

There is an ongoing project to add TH2Poly into the set of DQM supported types, I am adding THnSparseD/F to that list.

jfernan2 commented 2 years ago

+1 TH2Poly is now supported in DQM as far as I can tell

smuzaffar commented 10 months ago

@davidlange6 , can we close this issue?

davidlange6 commented 10 months ago

I’m not aware of a fix/conclusion here

cms-sw / cmssw

Is there a better way to track time dependent information in DQM? #19270