Implement ROOT's RDataFrame to address https://github.com/cms-tau-pog/TauFW/issues/51

Motivation

Current use of multithreading via python is bad practice because it leads to issues in managing ROOT objects in memory and ownership. Deleting ROOT histograms created with multithreading in python3 leads to segmentation faults. RDataFrame is now native in ROOT since v6.14, so should be used going forward.
Optimization: The old TauFW used MultiDraw to create multiple histograms per event loop for a single selection string, and python multithreading to run (Merged)Sample.gethist routines in parallel. RDataFrame should now handles a lot of the optimization itself. Additionally, it allows to run the event loop once per ROOT file for multiple selections (i.e. "filters"), filling multiple histograms. Note that RDataFrame parallelizes over clusters in given trees, so the more clusters a tree has, the more it can be parallelized. Additionally, there should be a gain if multiple RDataFrames are run concurrently with RDF.RunGraphs.

Implementation

New container classes for output

Added the ResultDict class, which is basically a set of nested dictionaries pointing to RDF.RResultPtr<TH1D> objects created by RDataFrame.Filter(selection).Histo1D(variable) for a given selection key, variable key, and sample key:
```
results_dict = {
  selection1: {
    variable1: {
      sample1: RDF.RResultPtr<T>, # where T = TH1D, etc.
      sample2: RDF.RResultPtr<T>,
      ...
    }
    ...
  }
  ...
}
```
Added the MergedResult class, which is basically a list of booked RDF.RResultPtr<T> objects, so their resulting value can be linearly added after the event loop. (E.g. summing histograms of MergedSample with TH1D::Add).

Updated HistSet to simplify:

HistSet.data = hist        # single TH1 histogram of real observed data
HistSet.exp = [hist, ... ] # list of TH1 histogram of SM background expectation
HistSet.sig = [hist, ... ] # list of TH1 histogram of BSM signals

Added HistDict which is similar to ResultDict, but a set of nested dictionaries to collect TH1 histograms via HistSet:
```
hist_dict = {
  selection1: {
    variable1: HistSet,
    variable2: HistSet,
    ...
  }
  ...
}
```

New routines

Added Sample.getrdframe that creates single RDataFrame object with a chain of filters for a given list of selections, and a books a set of histograms for a given list of variables. Basically:
1. It creates (or reuses) an RDataFrame.
2. It loops over all selections to apply filters with RDataFrame.Filter.
3. Per selection, it loops over all variables and books a histogram per variable.
4. This returns a ResultDict to prepare the event loop before its run.
```
rdframe = RDataFrame(treename,filename)
for selection in selections:
  rdf_sel = rdframe.Filter(selection)
  for variable in variables:
    rdf_var = rdf_sel.Filter(variable.cut)
    result  = rdf_var.Histo1D(hmodel,var.name,weight) # RDF.RResultPtr<TH1D>
    res_dict[selection][variable][self] = result
```
Added MergedSample.getrdframe.
1. It creates a single RDataFrame object per subsample. If the sample is split into subcomponents (e.g. DY split by genmatch into ZTT, ZL, ZJ), it can reuse a RDataFrame that was previously created for the same file and common event selection.
2. It merges ResultDict objects from the subsamples together, where the RDF.RResultPtr<T> objects for the subsamples are merged into a single MergedResult object per selection/variable. This is done so the resulting histograms can be summed correctly into one single histogram after the event loop is run.
3. Returns this ResultDict of MergedResult objects.
Added SampleSet.gethists_rdf to replace SampleSet.gethists:
1. It creates a single RDataFrame object per file for all samples, and collects the booked histograms in one, single, big ResultDict object using ResultDict.update.
  1. Once all RDataFrames and histograms are prepared for all samples, the event loop is triggered with ResultDict.run, which uses RDF.RunGraphs to concurrently run a list of all RDF.RResultPtr<T> in the ResultDict object for optimal performance.
  2. The results are converted to histograms and collected into HistSet objects by ResultDict.gethistset.
  3. Returns a set of nested dictionaries created by HistDict.results.

Changes to existing code

HistSet (see above) used to be a dictionary of variable pointing to (lists of) histograms.
The QCD_OSSS method now takes a list of variables and a list selections (instead of just a single selection).

Tools

Some common RDataFrame tools are added to TauFW/common/python/tools/RDataFrame.py, like:
- custom progressbar,
- SetNumberOfThreads calling ROOT.EnableImplicitMT(nthreads),
- AddRDFColumn which can be used to define a new RDataFrame column for a mathematical expression, while ensuring a unique name.
- printRDFReport for printing RDataFrame reports (basically a cutflow table for a bunch of filters/selections)
- ...

Validation

Unit tested and debugged with test script in TauFW/Plotter/test/testRDataFrame_Sample.py, and compared to the old MultiDraw & python multithreading. The results look identical for pseudo data and MC samples, and RDataFrame seems to scale better for a larger number of threads (>4) and selections.
I asked @oponcet to help validate with working example (plotting real data and creating real datacards) if she has time.

Plans

Will leave this PR as an open draft until it's completely validated.
Implement possibility for 2D histograms via Sample.getrdframe and using RDataFrame.Histo2D.
Will completely remove the old (Merged)Sample.gethist and SampleSet.gethists routines that relied on MultiDraw and python multithreading, and replace it with class methods of the same name that use RDataFrame.
Further clean the TauFW plotting code.
Discuss in the TauPOG here: https://indico.cern.ch/event/1358491/#3-plans-status-of-taufw

Updates

New features:
- Allow the creation of 2D histograms (TH2D) with RDataFrame via gethist2D, using the exact same Sample.getrdframe class method as for 1D histograms (TH1D). This can be done by passing a list of Variable pairs, i.e. a list of 2-tuples of the form [(xvar,yvar), ... ]. An unit test / example is given in Plotter/test/testRDataFrame_Sample.py. If a y variable is passed, the histogram will be booked with RDataFrame.Histo2D(hmodel,xvar,yvar,weight) (as a RDF.RResultPtr<TH2D>) instead of RDataFrame.Histo2D(hmodel,xvar,weight).
- Allow for blinding data in particular ranges of variables (as was implemented before).
- Added the Sample.getsumw class method which computes the sum-of-weights for one or more selections using RDataFrame via the same Sample.getrdframe method used for histograms. This can be done in parallel with either histograms or means if the keyword argument sumw=True to Sample.getrdframe is set. The sum-of-weights is added to the ResultDict as a RDF.RResultPtr<double> to the string key 'sumw', and for a given selection.
- Added the Sample.getmean class method which computes the mean of one or more variables (and for one or more selections) using RDataFrame via Sample.getrdframe as well. If the keyword argument mean=True to Sample.getrdframe is set, the mean of each given variable will be booked with RDataFrame.Mean(xvar,weight) (as a RDF.RResultPtr<double>) as opposed to RDataFrame.Histo2D(hmodel,xvar,weight).
- Added the MeanResult class to Plotter/python/sample/ResultDict.py, which is basically a pair of RDF.RResultPtr<double> results: one for the mean of a variable, and one for the sum-of-weights. This can be used to correctly sum the means of variables of merged samples by MergedResult, which will take into account the sum-of-weights of each sub sample (i.e. weighted sum of means).
Cleaning:
- Removed old, deprecated (Merged)Sample.gethist(2D) and SampleSet.gethists class methods based on python multithreading and MultiDraw. They are now replaced with new RDataFrame methods of same name (e.g. gethist_rdf → gethist). For debugging and comparison purposes, a branch with the latest version of MultiDraw is stored here: https://github.com/IzaakWN/TauFW/tree/python3_RDF_MultiDraw

Plans

I think this PR is mostly done. What's left is testing with a "real" example, comparing the plots (by plot*.py) and datacard inputs (ROOT files from createcards*.py) between this branch and the current master. If the results are consistent, and there are no bugs, we can merge this PR and create a new release version of the TauFW.

Note that the new (Merged)Sample.gethist(2D), SampleSet.gethists, and SampleSet.getstack methods using RDataFrame should be implemented such that the user does not notice a difference in the output (even though multiple selections are now allowed). This means that user scripts do not need to be updated, unless they want to parallelize over multiple selections.

cms-tau-pog / TauFW

Implement RDataFrame in Sample and SampleSet for plotting #56

Motivation

Implementation

New container classes for output

New routines

Changes to existing code

Tools

Validation

Plans

Updates

Plans