cms-tau-pog / TauFW

Analysis framework for tau analysis at CMS using NanoAOD
9 stars 40 forks source link

Parallel processing in python 3: `RDataFrame` ? #51

Open IzaakWN opened 8 months ago

IzaakWN commented 8 months ago

Unfortunately the current parallel processing functionality for creating histograms from trees in SampleSet.gethist broke when switching to python 3. Segmentation faults seem to be caused by a conflict between how python and ROOT handle their objects in the memory. (The parallel processing is done (ab)using python's multithreading.)

As a consequence, I am starting to look into completely redesigning the SampleSet.gethist/MergedSample.gethist/Sample.gethist routines using RDataFrame, which is native to ROOT since v6.14. I will probably make this the default routine replacing the old routine based on python's multithreading by Plotter/python/plot/MultiThread.py and MultiDraw.py/MultiDraw.cxx. The latter also has some unexpected behavior for array branches of variable length.

Besides solving the memory issues, this should be more performant because we can string together multiple instances of RDataFrame (see this section of the class reference):

from ROOT import RDataFrame, RDF
df1 = RDataFrame("tree", "DY.root")
df2 = RDataFrame("tree", "TT.root")
df1_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
df2_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
res1 = df1_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
res2 = df2_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
RDF.RunGraphs([res1,res2]) # runs df1 and df2 concurrently

and let RDataFrame optimize the parallel processing of many histograms (multiple samples x variables x selections) by itself.

Furthermore, we could even think of processing multiple variables and selections in one go. (The previous setup would only process multiple variables and samples in parallel, but sequentially for selections.)

IzaakWN commented 8 months ago

The segmentation fault in parallel processing should be solved by PR https://github.com/cms-tau-pog/TauFW/pull/52.

In the next weeks, I will try to implement RDataFrame into TauFW's SampleSet, similar to this standalone example: https://github.com/cms-tau-pog/TauFW/blob/master/Plotter/test/testRDataFrame.py

IzaakWN commented 7 months ago

Draft PR https://github.com/cms-tau-pog/TauFW/pull/56 Issue with mutlithreading and plans for RDataFrame implementation were presented and discussed in the TauPOG meeting (18/12/2023) here: https://indico.cern.ch/event/1358491/#3-plans-status-of-taufw