Open nsmith- opened 2 years ago
Hi, interesting observation! Yes, probably my commit that you linked there makes the calculatePartialBinVolume
part negligible.
There is still the manual memory allocation here of the new RooArgSet
, that you can also see in your plot:
https://github.com/root-project/root/blob/v6-22-00-patches/roofit/roofitcore/src/RooDataHist.cxx#L1478
Also that one was replaced with a RooArgSet
on the stack in 6.26:
https://github.com/root-project/root/commit/87ffe8affa57e2daa0ab02f177c20be03960162f
So with 6.26, all of that should go down to almost zero.
I could backport these improvements to 6.22 and 6.24, but we probably won't do a patch release just for that since it isn't even a bugfix but a performance improvement. I'd have a hard time to argue for that case, ROOT doesn't usually do patch releases for older releases after the next major release came out. Still, it's also not clear to me if it would end up in the CMSSW release.
I wonder if any subset of that might be usable in 6.22.
For sure these improvements could be cherry-picked on top of 6.22. But how are you thinking to make this happen? With a patched ROOT version in CMSSW? In any case, I'd be happy to help doing whatever is necessary to speed up your workflows.
With #776 and LCG102, indeed we spend a total of 30m building the asimov for the full run 2 combination vs. previously at least 6 hours (I gave up on the perf
run for that one). The left-heavy flame graph now looks like:
most of the time in RooDataHist::sum
is now self-time which is great
So let's call it one more thing to look forward to with ROOT 6.26 validation. I don't think it would be worth it at this point to try to port back, since it seems we are close to being able to use 6.26 directly.
It turns out this performance issue is also showing up during fitting:
Do you know if there is any way to somehow disable this caching? Also, is the existing version by chance a source of memory leaks/bloat?
I have a vague recollection of finding this too (I think for HZZ, but not sure now), and concluding that we couldn't do much, short of writing a more efficient pdf ourselves. There is caching for a basic integral over the full histogram, but IIRC there's a norm over slices here and that doesn't have caching support.
Interesting stack trace, there are some things I can understand from it.
So you have this loop where you are iterating observable values and evaluate your PDF: https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/102x/src/CachingNLL.cc#L342
It seems to me that this calculatePartialBinVolume
is called for each event then, otherwise it would not be so expensive right?
But actually this should not happen. The calculatePartialBinVolume
is called in RooRealIntegral
, where the integral is the integral of a RooHistPdf. But a RooHistPdf has no parameters, so the intergral doesn't depend on any variables. RooFit should consider this and not recompute these integrals.
Digging further, these integrals are created here, and put in a RooCheapProduct
:
https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/102x/src/VerticalInterpPdf.cc#L283
What I would do is before adding these integrals to the prodIntComps
, I would print them with Print("v")
. Like this you see their client-server structure. Maybe that gives us a hint to where the client-server structure is messed up such that RooFit doesn't get that recomputing these integrals is unnecessary.
Or do you somewhere disable RooFit's internal caching based on the objects states, via RooAbsArg::setOperMode()
for example?
Also this stack trace seems to be from ROOT 6.24. Wasn't calculatePartialBinVolume
significantly sped up in ROOT 6.26?
I was unclear earlier, but yes the last stack trace is still in ROOT 6.22. Intitially I hoped the slowdown only affected creation of the pseudo-asimov for HZZ but now I see it also affects the fit (presumably also from HZZ but I am not certain). I just today tried a fit in ROOT 6.26 (from LCG102) and things look much better:
compared to previously (now starting the flamegraph at CachingPdf::eval
:
Even though it is faster in 6.26, we may still want to understand if this re-evaluation is necessary. It looks like the code in question was introduced in #680. Actually the RooCheapProduct we see taking so much time here is being constructed for the branch taken on L233 in
https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/7454abd509fe5524d4c3607ffee8f230c781245a/src/VerticalInterpPdf.cc#L229-L235
where func
is a SimpleProdPdf, which then leads to
https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/7454abd509fe5524d4c3607ffee8f230c781245a/src/SimpleProdPdf.cc#L111
where one of the terms is a functional form describing $m{ZZ}$ and the other is a RooHistPdf of the simulation distribution in $(m{ZZ}, D{kin})$. The latter is a conditional pdf, so an integral over $D{kin}$ is constructed. An example RooHistPdf:
--- RooAbsArg ---
Value State: FORCED DIRTY
Shape State: DIRTY
Attributes: [SnapShot_ExtRefClone]
Address: 0x29992410
Clients:
(0x2998f140,V-) SimpleProdPdf::shapeBkg_zjets_ch3_ch60 ""
(0x29a76e00,--) RooRealIntegral::histpdfzx_4mu2018_ch3_ch60_Int[dbkg_kin] "Integral of histpdfzx_4mu"
Servers:
(0x19a6de20,V-) RooRealVar::ZZMass ""
(0x19a6e540,V-) RooRealVar::dbkg_kin ""
Proxies:
pdfObs ->
1) ZZMass
2) dbkg_kin
--- RooAbsReal ---
Plot label is "histpdfzx_4mu2018_ch3_ch60"
Batch data access not initialised.
--- RooAbsPdf ---
Cached value = 1.1014
and its integral:
--- RooAbsArg ---
Value State: DIRTY
Shape State: DIRTY
Attributes:
Address: 0x29a76e00
Clients:
Servers:
(0x29992410,--) RooHistPdf::histpdfzx_4mu2018_ch3_ch60 "histpdfzx_4mu"
(0x19a6de20,V-) RooRealVar::ZZMass ""
(0x19a6e540,-S) RooRealVar::dbkg_kin ""
Proxies:
!sumList ->
!intList ->
!anaList ->
1) dbkg_kin
!jacList ->
!facList ->
!func -> histpdfzx_4mu2018_ch3_ch60
!sumCat ->
--- RooAbsReal ---
Plot label is "histpdfzx_4mu2018_ch3_ch60_Int[dbkg_kin]"
Batch data access not initialised.
--- RooRealIntegral ---
Integrates histpdfzx_4mu2018_ch3_ch60[ pdfObs=(ZZMass,dbkg_kin) ]
operating mode is Analytic
Summed discrete args are ()
Numerically integrated args are ()
Analytically integrated args using mode 5 are (dbkg_kin)
Arguments included in Jacobian are ()
Factorized arguments are ()
Function normalization set <none>
As best as I can tell, then the integrals get re-evaluated as the observable ZZMass
is updated, hence the constant re-evaluation of calculatePartialBinVolume
. It should be able to at least cache the values for the 20 ZZMass
bins and return those quickly though. I am not sure where that falls through.
Okay then I have an idea. Even though you are having these conditional integrals, they still don't depend on any parameters in the RooHistPdf
case, only observables. So I think we could replace these *compPdf->createIntegral(*selObs)
calls in VerticalInterpPdf::makeConditionalProdPdfIntegral
in the case there compPdf
is a RooHistPdf.
Instead of returning this over-complicated integral object, we could return a new RooHistPdf where the underlying RooDataHist
is derived from the RooDataHist that compPdf
is based on, simply by summing over the selObs
dimensions.
Would this make sense to you? I can also help to implement this, as I'm also interested in having such a optimization for RooHistPdf::createIntegral
upstream. Because from your 6.26 benchmarks I can see that even though the situation is better, I think you can make that part in the flamegraph disappear with this optimization.
The other thing I am puzzled about is why this was a non-issue in ROOT 6.12 (CMSSW 10_2_X). The same allocation that is taking all this time is apparently, according to valgrind, never getting freed:
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
77 83,987,520,475 675,197,200 642,625,036 32,572,164 0
78 84,836,306,029 675,197,200 642,625,036 32,572,164 0
79 85,540,946,539 917,140,352 884,565,508 32,574,844 0
96.45% (884,565,508B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->48.05% (440,695,640B) 0x6FFA629: TStorage::ObjectAlloc(unsigned long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libCore.so)
| ->25.51% (234,000,000B) 0x5805BEC: RooArgSet::operator new(unsigned long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| | ->25.51% (234,000,000B) 0x583DCAD: RooDataHist::sum(RooArgSet const&, RooArgSet const&, bool, bool) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| | ->25.51% (234,000,000B) 0x58B8AD9: RooHistPdf::analyticalIntegral(int, char const*) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| | ->25.51% (234,000,000B) 0x57AB771: RooAbsPdf::analyticalIntegralWN(int, RooArgSet const*, char const*) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| | ->25.51% (234,000,000B) 0x593783D: RooRealIntegral::evaluate() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| | ->25.51% (234,000,000B) 0x57C56D2: RooAbsReal::traceEval(RooArgSet const*) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| | ->25.51% (234,000,000B) 0x5935F6B: RooRealIntegral::getValV(RooArgSet const*) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| | ->25.51% (234,000,000B) 0x4DF9199: getVal (RooAbsReal.h:96)
| | ->25.51% (234,000,000B) 0x4DF9199: RooCheapProduct::evaluate() const (RooCheapProduct.cc:44)
and is a source of huge memory leak for long-running fits. I'm somewhat puzzled because it looks to be deleted properly in https://github.com/root-project/root/blob/v6-22-08/roofit/roofitcore/src/RooDataHist.cxx#L1478-L1483
Instead of returning this over-complicated integral object, we could return a new RooHistPdf where the underlying RooDataHist is derived from the RooDataHist that compPdf is based on, simply by summing over the selObs dimensions. Would this make sense to you? I can also help to implement this, as I'm also interested in having such a optimization for RooHistPdf::createIntegral upstream. Because from your 6.26 benchmarks I can see that even though the situation is better, I think you can make that part in the flamegraph disappear with this optimization.
Absolutely! I think this would also be an effective workaround for the memory leak as well, since the resulting RooHistPdf does not need to be summed at all.
It's quite likely that the memory leak is related to the memory pool for RooArgSets. It's not the first time that it would be the cause of a mysterious memory leak.
The thing is: RooFit caches normalization integrals where the keys are the memory addresses of the RooArgSet
with the normalized-over observables. As memory might be randomly reused, there were rare cases where the cache returned integrals over a completely unrelated norm set, just because it was at the same address before.
The "solution" originally implemented in RooFit was to overload operator new
and delete
for RooArgSet, such that the addresses were obtained from a memory pool that more or less tried to return no repeated address (here is the original implementation before ROOT 6.14).
With ROOT 6.14, the memory pool got refactored, and was improved to be more rigorous in remembering which addresses were already used (the original implementation only remembered the last N RooArgSet
addresses for a large N, assuming that memory reuse at the same location would happen usually not far away in time).
That means, from ROOT 6.14, memory will increasingly "leak" if you create RooArgSets on the heap, because the ranges of all addresses used so far are remembered by the pool. The same goes for the RooDataSet.
With ROOT 6.28, this will be a no issue again, because this time I applied the right fix: RooFit was redesigned such that things are not cached by pointer addresses anymore. See also the 6.28 docs of RooArgSet ("Uniquely identifying RooArgSet objects"). The custom memory pool was therefore disabled.
Instead of returning this over-complicated integral object, we could return a new RooHistPdf where the underlying RooDataHist is derived from the RooDataHist that compPdf is based on, simply by summing over the selObs dimensions. Would this make sense to you? I can also help to implement this, as I'm also interested in having such a optimization for RooHistPdf::createIntegral upstream. Because from your 6.26 benchmarks I can see that even though the situation is better, I think you can make that part in the flamegraph disappear with this optimization.
Absolutely! I think this would also be an effective workaround for the memory leak as well, since the resulting RooHistPdf does not need to be summed at all.
Okay! I will work on this. Feel free to assign the issue to me so it doesn't get lost over the Christmas break :+1:
Unfortunately I haven't made progress on this yet, but just for the record there is also a ROOT Jira ticket from 2020 about this: https://sft.its.cern.ch/jira/browse/ROOT-10519
Maybe the insights in the discussion there are helpful for the when I fix the problem.
The generation of the pseudo-asimov for unbinned channels (
toymcoptutils::SinglePdfGenInfo::generatePseudoAsimov
) can take a very long time (several hours) in initial fits. From profiling, most of the time is in what appear to be memory allocation calls inside RooFit code, in particularRooDataHist::calculatePartialBinVolume
:This is in the 112x-comb2022 branch with ROOT 6.22
I see that @guitargeek recently made some improvements to this section of the code in https://github.com/root-project/root/commit/6120b9fe3679f4c2080b405b586482edb65302fd I wonder if any subset of that might be usable in 6.22. Meanwhile I will try with 6.26 soon to see if the performance is better there
NB: the flamegraphs were made using `perf:
and then imported into https://www.speedscope.app/