Excessive allocations in calculatePartialBinVolume slowing down fits

cms-analysis / HiggsAnalysis-CombinedLimit

CMS Higgs Combination toolkit.

https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/latest

Apache License 2.0

75 stars 385 forks source link

Excessive allocations in calculatePartialBinVolume slowing down fits #786

Open nsmith- opened 2 years ago

nsmith- commented 2 years ago

The generation of the pseudo-asimov for unbinned channels (toymcoptutils::SinglePdfGenInfo::generatePseudoAsimov) can take a very long time (several hours) in initial fits. From profiling, most of the time is in what appear to be memory allocation calls inside RooFit code, in particular RooDataHist::calculatePartialBinVolume:

This is in the 112x-comb2022 branch with ROOT 6.22

I see that @guitargeek recently made some improvements to this section of the code in https://github.com/root-project/root/commit/6120b9fe3679f4c2080b405b586482edb65302fd I wonder if any subset of that might be usable in 6.22. Meanwhile I will try with 6.26 soon to see if the performance is better there

NB: the flamegraphs were made using `perf:

perf record -F 10 -g --call-graph dwarf -- combine ...
perf script -i perf.data | gzip > profile.txt.gz

and then imported into https://www.speedscope.app/

guitargeek commented 2 years ago

Hi, interesting observation! Yes, probably my commit that you linked there makes the calculatePartialBinVolume part negligible.

There is still the manual memory allocation here of the new RooArgSet, that you can also see in your plot: https://github.com/root-project/root/blob/v6-22-00-patches/roofit/roofitcore/src/RooDataHist.cxx#L1478

Also that one was replaced with a RooArgSet on the stack in 6.26: https://github.com/root-project/root/commit/87ffe8affa57e2daa0ab02f177c20be03960162f

So with 6.26, all of that should go down to almost zero.

I could backport these improvements to 6.22 and 6.24, but we probably won't do a patch release just for that since it isn't even a bugfix but a performance improvement. I'd have a hard time to argue for that case, ROOT doesn't usually do patch releases for older releases after the next major release came out. Still, it's also not clear to me if it would end up in the CMSSW release.

I wonder if any subset of that might be usable in 6.22.

For sure these improvements could be cherry-picked on top of 6.22. But how are you thinking to make this happen? With a patched ROOT version in CMSSW? In any case, I'd be happy to help doing whatever is necessary to speed up your workflows.

nsmith- commented 2 years ago

With #776 and LCG102, indeed we spend a total of 30m building the asimov for the full run 2 combination vs. previously at least 6 hours (I gave up on the perf run for that one). The left-heavy flame graph now looks like:

most of the time in RooDataHist::sum is now self-time which is great

So let's call it one more thing to look forward to with ROOT 6.26 validation. I don't think it would be worth it at this point to try to port back, since it seems we are close to being able to use 6.26 directly.

nsmith- commented 1 year ago

It turns out this performance issue is also showing up during fitting:

Do you know if there is any way to somehow disable this caching? Also, is the existing version by chance a source of memory leaks/bloat?

ajgilbert commented 1 year ago

I have a vague recollection of finding this too (I think for HZZ, but not sure now), and concluding that we couldn't do much, short of writing a more efficient pdf ourselves. There is caching for a basic integral over the full histogram, but IIRC there's a norm over slices here and that doesn't have caching support.

guitargeek commented 1 year ago

Interesting stack trace, there are some things I can understand from it.

So you have this loop where you are iterating observable values and evaluate your PDF: https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/102x/src/CachingNLL.cc#L342

It seems to me that this calculatePartialBinVolume is called for each event then, otherwise it would not be so expensive right?

But actually this should not happen. The calculatePartialBinVolume is called in RooRealIntegral, where the integral is the integral of a RooHistPdf. But a RooHistPdf has no parameters, so the intergral doesn't depend on any variables. RooFit should consider this and not recompute these integrals.

Digging further, these integrals are created here, and put in a RooCheapProduct: https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/102x/src/VerticalInterpPdf.cc#L283

What I would do is before adding these integrals to the prodIntComps, I would print them with Print("v"). Like this you see their client-server structure. Maybe that gives us a hint to where the client-server structure is messed up such that RooFit doesn't get that recomputing these integrals is unnecessary.

Or do you somewhere disable RooFit's internal caching based on the objects states, via RooAbsArg::setOperMode() for example?

guitargeek commented 1 year ago

Also this stack trace seems to be from ROOT 6.24. Wasn't calculatePartialBinVolume significantly sped up in ROOT 6.26?

nsmith- commented 1 year ago

I was unclear earlier, but yes the last stack trace is still in ROOT 6.22. Intitially I hoped the slowdown only affected creation of the pseudo-asimov for HZZ but now I see it also affects the fit (presumably also from HZZ but I am not certain). I just today tried a fit in ROOT 6.26 (from LCG102) and things look much better: compared to previously (now starting the flamegraph at CachingPdf::eval:

nsmith- commented 1 year ago

Even though it is faster in 6.26, we may still want to understand if this re-evaluation is necessary. It looks like the code in question was introduced in #680. Actually the RooCheapProduct we see taking so much time here is being constructed for the branch taken on L233 in https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/7454abd509fe5524d4c3607ffee8f230c781245a/src/VerticalInterpPdf.cc#L229-L235 where func is a SimpleProdPdf, which then leads to https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/7454abd509fe5524d4c3607ffee8f230c781245a/src/SimpleProdPdf.cc#L111 where one of the terms is a functional form describing $m{ZZ}$ and the other is a RooHistPdf of the simulation distribution in $(m{ZZ}, D{kin})$. The latter is a conditional pdf, so an integral over $D{kin}$ is constructed. An example RooHistPdf:

--- RooAbsArg ---
  Value State: FORCED DIRTY
  Shape State: DIRTY
  Attributes:  [SnapShot_ExtRefClone]
  Address: 0x29992410
  Clients:
    (0x2998f140,V-) SimpleProdPdf::shapeBkg_zjets_ch3_ch60 ""
    (0x29a76e00,--) RooRealIntegral::histpdfzx_4mu2018_ch3_ch60_Int[dbkg_kin] "Integral of histpdfzx_4mu"
  Servers:
    (0x19a6de20,V-) RooRealVar::ZZMass ""
    (0x19a6e540,V-) RooRealVar::dbkg_kin ""
  Proxies:
    pdfObs ->
      1)    ZZMass
      2)  dbkg_kin
--- RooAbsReal ---

  Plot label is "histpdfzx_4mu2018_ch3_ch60"
Batch data access not initialised.
--- RooAbsPdf ---
Cached value = 1.1014

and its integral:

--- RooAbsArg ---
  Value State: DIRTY
  Shape State: DIRTY
  Attributes:
  Address: 0x29a76e00
  Clients:
  Servers:
    (0x29992410,--) RooHistPdf::histpdfzx_4mu2018_ch3_ch60 "histpdfzx_4mu"
    (0x19a6de20,V-) RooRealVar::ZZMass ""
    (0x19a6e540,-S) RooRealVar::dbkg_kin ""
  Proxies:
    !sumList ->
    !intList ->
    !anaList ->
      1)  dbkg_kin
    !jacList ->
    !facList ->
    !func -> histpdfzx_4mu2018_ch3_ch60
    !sumCat ->
--- RooAbsReal ---

  Plot label is "histpdfzx_4mu2018_ch3_ch60_Int[dbkg_kin]"
Batch data access not initialised.
--- RooRealIntegral ---
  Integrates histpdfzx_4mu2018_ch3_ch60[ pdfObs=(ZZMass,dbkg_kin) ]
  operating mode is Analytic
  Summed discrete args are ()
  Numerically integrated args are ()
  Analytically integrated args using mode 5 are (dbkg_kin)
  Arguments included in Jacobian are ()
  Factorized arguments are ()
  Function normalization set <none>

As best as I can tell, then the integrals get re-evaluated as the observable ZZMass is updated, hence the constant re-evaluation of calculatePartialBinVolume. It should be able to at least cache the values for the 20 ZZMass bins and return those quickly though. I am not sure where that falls through.

guitargeek commented 1 year ago

Okay then I have an idea. Even though you are having these conditional integrals, they still don't depend on any parameters in the RooHistPdf case, only observables. So I think we could replace these *compPdf->createIntegral(*selObs) calls in VerticalInterpPdf::makeConditionalProdPdfIntegral in the case there compPdf is a RooHistPdf.

Instead of returning this over-complicated integral object, we could return a new RooHistPdf where the underlying RooDataHist is derived from the RooDataHist that compPdf is based on, simply by summing over the selObs dimensions.

Would this make sense to you? I can also help to implement this, as I'm also interested in having such a optimization for RooHistPdf::createIntegral upstream. Because from your 6.26 benchmarks I can see that even though the situation is better, I think you can make that part in the flamegraph disappear with this optimization.

nsmith- commented 1 year ago

The other thing I am puzzled about is why this was a non-issue in ROOT 6.12 (CMSSW 10_2_X). The same allocation that is taking all this time is apparently, according to valgrind, never getting freed:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 77 83,987,520,475      675,197,200      642,625,036    32,572,164            0
 78 84,836,306,029      675,197,200      642,625,036    32,572,164            0
 79 85,540,946,539      917,140,352      884,565,508    32,574,844            0
96.45% (884,565,508B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->48.05% (440,695,640B) 0x6FFA629: TStorage::ObjectAlloc(unsigned long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libCore.so)
| ->25.51% (234,000,000B) 0x5805BEC: RooArgSet::operator new(unsigned long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| | ->25.51% (234,000,000B) 0x583DCAD: RooDataHist::sum(RooArgSet const&, RooArgSet const&, bool, bool) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| |   ->25.51% (234,000,000B) 0x58B8AD9: RooHistPdf::analyticalIntegral(int, char const*) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| |     ->25.51% (234,000,000B) 0x57AB771: RooAbsPdf::analyticalIntegralWN(int, RooArgSet const*, char const*) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| |       ->25.51% (234,000,000B) 0x593783D: RooRealIntegral::evaluate() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| |         ->25.51% (234,000,000B) 0x57C56D2: RooAbsReal::traceEval(RooArgSet const*) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| |           ->25.51% (234,000,000B) 0x5935F6B: RooRealIntegral::getValV(RooArgSet const*) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc900/lcg/root/6.22.08-ljfedo/lib/libRooFitCore.so)
| |             ->25.51% (234,000,000B) 0x4DF9199: getVal (RooAbsReal.h:96)
| |               ->25.51% (234,000,000B) 0x4DF9199: RooCheapProduct::evaluate() const (RooCheapProduct.cc:44)

and is a source of huge memory leak for long-running fits. I'm somewhat puzzled because it looks to be deleted properly in https://github.com/root-project/root/blob/v6-22-08/roofit/roofitcore/src/RooDataHist.cxx#L1478-L1483

nsmith- commented 1 year ago

Instead of returning this over-complicated integral object, we could return a new RooHistPdf where the underlying RooDataHist is derived from the RooDataHist that compPdf is based on, simply by summing over the selObs dimensions. Would this make sense to you? I can also help to implement this, as I'm also interested in having such a optimization for RooHistPdf::createIntegral upstream. Because from your 6.26 benchmarks I can see that even though the situation is better, I think you can make that part in the flamegraph disappear with this optimization.

Absolutely! I think this would also be an effective workaround for the memory leak as well, since the resulting RooHistPdf does not need to be summed at all.

guitargeek commented 1 year ago

It's quite likely that the memory leak is related to the memory pool for RooArgSets. It's not the first time that it would be the cause of a mysterious memory leak.

The thing is: RooFit caches normalization integrals where the keys are the memory addresses of the RooArgSet with the normalized-over observables. As memory might be randomly reused, there were rare cases where the cache returned integrals over a completely unrelated norm set, just because it was at the same address before.

The "solution" originally implemented in RooFit was to overload operator new and delete for RooArgSet, such that the addresses were obtained from a memory pool that more or less tried to return no repeated address (here is the original implementation before ROOT 6.14).

With ROOT 6.14, the memory pool got refactored, and was improved to be more rigorous in remembering which addresses were already used (the original implementation only remembered the last N RooArgSet addresses for a large N, assuming that memory reuse at the same location would happen usually not far away in time).

That means, from ROOT 6.14, memory will increasingly "leak" if you create RooArgSets on the heap, because the ranges of all addresses used so far are remembered by the pool. The same goes for the RooDataSet.

With ROOT 6.28, this will be a no issue again, because this time I applied the right fix: RooFit was redesigned such that things are not cached by pointer addresses anymore. See also the 6.28 docs of RooArgSet ("Uniquely identifying RooArgSet objects"). The custom memory pool was therefore disabled.

guitargeek commented 1 year ago

Instead of returning this over-complicated integral object, we could return a new RooHistPdf where the underlying RooDataHist is derived from the RooDataHist that compPdf is based on, simply by summing over the selObs dimensions. Would this make sense to you? I can also help to implement this, as I'm also interested in having such a optimization for RooHistPdf::createIntegral upstream. Because from your 6.26 benchmarks I can see that even though the situation is better, I think you can make that part in the flamegraph disappear with this optimization.

Absolutely! I think this would also be an effective workaround for the memory leak as well, since the resulting RooHistPdf does not need to be summed at all.

Okay! I will work on this. Feel free to assign the issue to me so it doesn't get lost over the Christmas break :+1:

guitargeek commented 1 year ago

Unfortunately I haven't made progress on this yet, but just for the record there is also a ROOT Jira ticket from 2020 about this: https://sft.its.cern.ch/jira/browse/ROOT-10519

Maybe the insights in the discussion there are helpful for the when I fix the problem.