Open IzaakWN opened 1 year ago
The segmentation fault in parallel processing should be solved by PR https://github.com/cms-tau-pog/TauFW/pull/52.
In the next weeks, I will try to implement RDataFrame
into TauFW's SampleSet
, similar to this standalone example:
https://github.com/cms-tau-pog/TauFW/blob/master/Plotter/test/testRDataFrame.py
Draft PR https://github.com/cms-tau-pog/TauFW/pull/56 Issue with mutlithreading and plans for RDataFrame implementation were presented and discussed in the TauPOG meeting (18/12/2023) here: https://indico.cern.ch/event/1358491/#3-plans-status-of-taufw
Unfortunately the current parallel processing functionality for creating histograms from trees in
SampleSet.gethist
broke when switching to python 3. Segmentation faults seem to be caused by a conflict between how python and ROOT handle their objects in the memory. (The parallel processing is done (ab)using python's multithreading.)As a consequence, I am starting to look into completely redesigning the
SampleSet.gethist
/MergedSample.gethist
/Sample.gethist
routines usingRDataFrame
, which is native to ROOT since v6.14. I will probably make this the default routine replacing the old routine based on python's multithreading byPlotter/python/plot/MultiThread.py
andMultiDraw.py
/MultiDraw.cxx
. The latter also has some unexpected behavior for array branches of variable length.Besides solving the memory issues, this should be more performant because we can string together multiple instances of
RDataFrame
(see this section of the class reference):and let
RDataFrame
optimize the parallel processing of many histograms (multiple samples x variables x selections) by itself.Furthermore, we could even think of processing multiple variables and selections in one go. (The previous setup would only process multiple variables and samples in parallel, but sequentially for selections.)