Closed IzaakWN closed 5 months ago
gethist2D
, using the exact same Sample.getrdframe
class method as for 1D histograms (TH1D). This can be done by passing a list of Variable pairs, i.e. a list of 2-tuples of the form [(xvar,yvar), ... ]
. An unit test / example is given in Plotter/test/testRDataFrame_Sample.py
. If a y variable is passed, the histogram will be booked with RDataFrame.Histo2D(hmodel,xvar,yvar,weight)
(as a RDF.RResultPtr<TH2D>
) instead of RDataFrame.Histo2D(hmodel,xvar,weight)
.Sample.getsumw
class method which computes the sum-of-weights for one or more selections using RDataFrame via the same Sample.getrdframe
method used for histograms. This can be done in parallel with either histograms or means if the keyword argument sumw=True
to Sample.getrdframe
is set. The sum-of-weights is added to the ResultDict
as a RDF.RResultPtr<double>
to the string key 'sumw'
, and for a given selection.Sample.getmean
class method which computes the mean of one or more variables (and for one or more selections) using RDataFrame via Sample.getrdframe
as well. If the keyword argument mean=True
to Sample.getrdframe
is set, the mean of each given variable will be booked with RDataFrame.Mean(xvar,weight)
(as a RDF.RResultPtr<double>
) as opposed to RDataFrame.Histo2D(hmodel,xvar,weight)
.MeanResult
class to Plotter/python/sample/ResultDict.py
, which is basically a pair of RDF.RResultPtr<double>
results: one for the mean of a variable, and one for the sum-of-weights. This can be used to correctly sum the means of variables of merged samples by MergedResult
, which will take into account the sum-of-weights of each sub sample (i.e. weighted sum of means).(Merged)Sample.gethist(2D)
and SampleSet.gethists
class methods based on python multithreading and MultiDraw. They are now replaced with new RDataFrame methods of same name (e.g. gethist_rdf
→ gethist
). For debugging and comparison purposes, a branch with the latest version of MultiDraw is stored here: https://github.com/IzaakWN/TauFW/tree/python3_RDF_MultiDrawI think this PR is mostly done. What's left is testing with a "real" example, comparing the plots (by plot*.py
) and datacard inputs (ROOT files from createcards*.py
) between this branch and the current master. If the results are consistent, and there are no bugs, we can merge this PR and create a new release version of the TauFW.
Note that the new (Merged)Sample.gethist(2D)
, SampleSet.gethists
, and SampleSet.getstack
methods using RDataFrame should be implemented such that the user does not notice a difference in the output (even though multiple selections are now allowed). This means that user scripts do not need to be updated, unless they want to parallelize over multiple selections.
Changed the target branch to hackathon
, which will be the development branch during the CAT Hackathon (2/2024): https://gitlab.cern.ch/groups/cms-tau-pog/-/epics/1
Implement ROOT's RDataFrame to address https://github.com/cms-tau-pog/TauFW/issues/51
Motivation
MultiDraw
to create multiple histograms per event loop for a single selection string, and python multithreading to run(Merged)Sample.gethist
routines in parallel. RDataFrame should now handles a lot of the optimization itself. Additionally, it allows to run the event loop once per ROOT file for multiple selections (i.e. "filters"), filling multiple histograms. Note that RDataFrame parallelizes over clusters in given trees, so the more clusters a tree has, the more it can be parallelized. Additionally, there should be a gain if multiple RDataFrames are run concurrently withRDF.RunGraphs
.Implementation
New container classes for output
ResultDict
class, which is basically a set of nested dictionaries pointing toRDF.RResultPtr<TH1D>
objects created byRDataFrame.Filter(selection).Histo1D(variable)
for a given selection key, variable key, and sample key:MergedResult
class, which is basically a list of bookedRDF.RResultPtr<T>
objects, so their resulting value can be linearly added after the event loop. (E.g. summing histograms ofMergedSample
withTH1D::Add
).HistSet
to simplify:HistDict
which is similar toResultDict
, but a set of nested dictionaries to collectTH1
histograms viaHistSet
:New routines
Sample.getrdframe
that creates single RDataFrame object with a chain of filters for a given list of selections, and a books a set of histograms for a given list of variables. Basically:RDataFrame.Filter
.ResultDict
to prepare the event loop before its run.MergedSample.getrdframe
.ResultDict
objects from the subsamples together, where theRDF.RResultPtr<T>
objects for the subsamples are merged into a singleMergedResult
object per selection/variable. This is done so the resulting histograms can be summed correctly into one single histogram after the event loop is run.ResultDict
ofMergedResult
objects.SampleSet.gethists_rdf
to replaceSampleSet.gethists
:ResultDict
object usingResultDict.update
.ResultDict.run
, which usesRDF.RunGraphs
to concurrently run a list of allRDF.RResultPtr<T>
in theResultDict
object for optimal performance.HistSet
objects byResultDict.gethistset
.HistDict.results
.Changes to existing code
HistSet
(see above) used to be a dictionary of variable pointing to (lists of) histograms.QCD_OSSS
method now takes a list of variables and a list selections (instead of just a single selection).Tools
TauFW/common/python/tools/RDataFrame.py
, like:SetNumberOfThreads
callingROOT.EnableImplicitMT(nthreads)
,AddRDFColumn
which can be used to define a new RDataFrame column for a mathematical expression, while ensuring a unique name.printRDFReport
for printing RDataFrame reports (basically a cutflow table for a bunch of filters/selections)Validation
TauFW/Plotter/test/testRDataFrame_Sample.py
, and compared to the oldMultiDraw
& python multithreading. The results look identical for pseudo data and MC samples, and RDataFrame seems to scale better for a larger number of threads (>4) and selections.Plans
Sample.getrdframe
and usingRDataFrame.Histo2D
.(Merged)Sample.gethist
andSampleSet.gethists
routines that relied onMultiDraw
and python multithreading, and replace it with class methods of the same name that use RDataFrame.