Mu2e / Offline

Offline software for the Mu2e experiment
Apache License 2.0
8 stars 83 forks source link

Custom objects in TTrees can't be read in Python #186

Closed soleti closed 4 years ago

soleti commented 4 years ago

The Python package uproot is becoming the de facto standard for converting TTrees into Numpy arrays or Pandas dataframes. Unfortunately, it is not able to read custom objects (and I think PyROOT also have this issue). In particular, I am referring to what happens e.g. in ComboHitDiag (but also other analyzers) where we store positions and directions as XYZVec. In my opinion, we should try to store this type of information as flat objects (for example three floats like pos_x, pos_y and pos_z) or fixed-size arrays (which can be read by uproot) in order to make TTrees as widely accessible as reasonably possible.

Edit: in the past I opened an issue on the uproot github where the author confirmed that there is no easy way to read these objects https://github.com/scikit-hep/uproot/issues/418.

ryuwd commented 4 years ago

uproot_methods can read a small number of ROOT objects like TVector3, which could be written to TTrees and read by uproot instead of XYZVec or Hep3Vector. (https://github.com/scikit-hep/uproot-methods/tree/master/uproot_methods/classes)

(otherwise I agree!)

kutschke commented 4 years ago

HI Roberto,

Are you speaking about reading our art format event data files, TrkAna files, Stntuple files? All of the above? Something else?

Rob

On Apr 30, 2020, at 5:28 PM, Stefano Roberto Soleti notifications@github.com wrote:

The Python package uproot is becoming the de facto standard for converting TTrees into Numpy arrays or Pandas dataframes. Unfortunately, it is not able to read custom objects (and I think PyROOT also have this issue). In particular, I am referring to what happens e.g. in ComboHitDiag (but also other analyzers) where we store positions and directions as XYZVec. In my opinion, we should try to store this type of information as flat objects (for example three floats like pos_x, pos_y and pos_z) or fixed-size arrays (which can be read by uproot) in order to make TTrees as widely accessible as reasonably possible.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

kutschke commented 4 years ago

Be aware that the TVector family of objects has horrible computing performance. I was on a code review for some DUNE pattern recognition code. It was one of the few pieces of HEP code that I have ever seen with a true computing kernel. The author replaced the use of TVector with something else, in about 20 lines of code ( leaving the interface as TVector). After this change the code ran 4 times faster. Previously the code spent all of it's time in the c'tor and d'tor of TObject even though the code in question made no use of the TObject-ness of these objects.

Rob

On Apr 30, 2020, at 7:02 PM, ryuwd notifications@github.com wrote:

uproot_methods can read a small number of ROOT objects like TVector3, which could be written to TTrees and read by uproot instead of XYZVec or Hep3Vector. (https://github.com/scikit-hep/uproot-methods/tree/master/uproot_methods/classes)

(otherwise I agree!)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

brownd1978 commented 4 years ago

Hi Roberto, XYZVec is just our typedef for a root native (templated) class, not really a custom object. In general it is much less error prone to directly store objects instead of untyped collections of fundamental types, which need to be translated to/from objects on readback. Can we teach Numpy about these objects? Or maybe request support from the developers?

Dave

On Thu, Apr 30, 2020 at 3:28 PM Stefano Roberto Soleti < notifications@github.com> wrote:

The Python package uproot https://github.com/scikit-hep/uproot/blob/master/README.rst is becoming the de facto standard for converting TTrees into Numpy arrays or Pandas dataframes. Unfortunately, it is not able to read custom objects (and I think PyROOT also have this issue). In particular, I am referring to what happens e.g. in ComboHitDiag (but also other analyzers) where we store positions and directions as XYZVec. In my opinion, we should try to store this type of information as flat objects (for example three floats like pos_x, pos_y and pos_z) or fixed-size arrays (which can be read by uproot) in order to make TTrees as widely accessible as reasonably possible.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mu2e/Offline/issues/186, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAH576T7MXEJMRFPSVEV5LRPH3QLANCNFSM4MWV6YDA .

-- David Nathan Brown Dave_Brown@lbl.gov Office Phone (510) 486-7261 Fax 495-2957 Lawrence Berkeley National Lab MS 50R5008 (50-6026C) Berkeley, CA 94720

soleti commented 4 years ago

Hi everyone, I am speaking about any ROOT TTree output we want to analyze/read outside of art. I discussed this with the developer of uproot and it doesn't seem possible to easily "teach" uproot how to read these objects. Since for positions and directions the only information we need are really just the values of x, y, and z we could just store them as arrays of fixed size. It is true that, as @ryuwd said, uproot is able to read a limited amount of ROOT classes, but personally I think we should try to store TTrees that resemble tables of numbers as much as possible. This would enable "last-mile" analyses with basically any tool/language, so the user can break free of the ROOT ecosystem (if he wants :) ). Flattening the TTrees also has advantages in terms of speed of vectorized operations with pandas/numpy.

kutschke commented 4 years ago

Hi Roberto,

On Apr 30, 2020, at 10:38 PM, Stefano Roberto Soleti notifications@github.com wrote:

Hi everyone, I am speaking about any ROOT TTree output we want to analyze/read outside of art.

All of our art files are stored with the Event TTree maximally split, which means that every float/int/double etc is it's own leaf. How does this differ from what you are asking for? I don't know about the structure of TrkAna and Stnutple files.

Which of these have you looked at?

One of our planned projects is to understand if changing to a less-than-maximallly split file would improve IO performance enough to be interesting.

I discussed this with the developer of uproot and it doesn't seem possible to easily "teach" uproot how to read these objects. Since for positions and directions the only information we need are really just the values of x, y, and z we could just store them as arrays of fixed size. It is true that, as @ryuwd said, uproot is able to read a limited amount of ROOT classes, but personally I think we should try to store TTrees that resemble tables of numbers as much as possible.

This would enable "last-mile" analyses with basically any tool/language, so the user can break free of the ROOT ecosystem (if he wants :) ). Flattening the TTrees also has advantages in terms of speed of vectorized operations with pandas/numpy.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

soleti commented 4 years ago

It is true that they are maximally split, but for example ComboHitDiag stores XYZVec objects, which can't be read by uproot. My proposal is to try to store information with fundamental types as much as reasonably possible. I think in the case of the positions and directions this is reasonable.

brownd1978 commented 4 years ago

Hi Roberto, There are several design issues here. What you are asking for is a translation stage. Translation is error prone and should be avoided if possible. Second, objects were invented for a reason, namely to keep related info together and provide methods that make sense on the ensemble (like R(), magnitude of a vector). We give that up if we flatten content to a list of floats. Finally, variable length branches are essential for some things, like info about individual hits on a track, which can’t be flattened (can uproot or numpy handle that?).

I am sympathetic to making our data accessible to as many tools as possible, but we shouldn’t give up important design considerations. In my opinion being able to add support for classes we need should be a requirement of any analysis tool we decide to support. If core developers don’t want to do this for us, maybe someone in Mu2e needs to work on it. If the core design of uproot or Numpy is such that adding new object support is intrinsically impossible or very difficult, maybe we should look for a different tool.

Dave

On Thu, Apr 30, 2020 at 21:19 Stefano Roberto Soleti < notifications@github.com> wrote:

It is true that they are maximally split, but for example ComboHitDiag stores XYZVec objects, which can't be read by uproot. My proposal is to try to store information with fundamental types as much as reasonably possible. I think in the case of the positions and directions this is reasonable.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/Mu2e/Offline/issues/186#issuecomment-622240393, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAH572A3KICWPOXKC4BOJ3RPJEU3ANCNFSM4MWV6YDA .

-- David Nathan Brown Dave_Brown@lbl.gov Office Phone (510) 486-7261 Fax 495-2957 Lawrence Berkeley National Lab MS 50R5008 (50-6026C) Berkeley, CA 94720

soleti commented 4 years ago

Hi Dave,

I think there are two issues here being conflated:

ryuwd commented 4 years ago

Finally, variable length branches are essential for some things, like info about individual hits on a track, which can’t be flattened (can uproot or numpy handle that?).

In response to this, maybe I can offer an example of how I have been analysing a ROOT Tree with uproot and pandas. I use a Tree in an Analyzer I wrote to produce diag plots for nofield cosmic tracking, and (unrelated) set up an alignment iteration. I think it is in line with the style Roberto is proposing.

Read more My Tree is designed to store information about a track (track-level variables), and hits on that track (hit-level variables). It has variable-length arrays in each entry. I defined the TTree like this: https://github.com/ryuwd/Offline/blob/dc2ef5a0f58bd4480714a6f0ab6a3bd3fdad3eec/Alignment/src/AlignTrackCollector_module.cc#L86-L115 https://github.com/ryuwd/Offline/blob/dc2ef5a0f58bd4480714a6f0ab6a3bd3fdad3eec/Alignment/src/AlignTrackCollector_module.cc#L257-L287 In my python script, I open the root file produced by the module, and access the tree like this: https://github.com/ryuwd/Offline/blob/dc2ef5a0f58bd4480714a6f0ab6a3bd3fdad3eec/Alignment/scripts/aligntrack_display.py#L113 N.B. uproot has tools for reading in chunks, rather than all in one go I produce a flattened pandas `DataFrame` with 'multiindexing'. https://github.com/ryuwd/Offline/blob/dc2ef5a0f58bd4480714a6f0ab6a3bd3fdad3eec/Alignment/scripts/aligntrack_display.py#L156 which has this structure, with one subentry per track hit. Track variables have only one entry per track, but due to the flattened nature of the DataFrame are repeated over all subentries for the track entry: ``` >>> df nHits doca_resid time_resid doca_resid_err ... ndof pvalue panels_trav planes_trav entry subentry ... 0 0 28 0.103550 1.768403 0.107183 ... 23 0.826624 15 9 1 28 -0.373126 -5.447091 0.107210 ... 23 0.826624 15 9 2 28 -0.206087 -3.008568 0.136744 ... 23 0.826624 15 9 3 28 0.420869 7.021034 0.137105 ... 23 0.826624 15 9 4 28 -0.240347 -3.508711 0.091719 ... 23 0.826624 15 9 ... ... ... ... ... ... ... ... ... ... 7568 6 11 -0.017652 -0.305980 0.126198 ... 6 0.987716 6 4 7 11 -0.192678 -2.990898 0.086304 ... 6 0.987716 6 4 8 11 -0.287834 -5.078882 0.082330 ... 6 0.987716 6 4 ``` A variable like `nHits` will be repeated many times over the subentry rows. This can be mitigated by having awareness of the dataset and making sure to select the first subentry per entry when working with such variables. i.e. ``` >>> nHits = df['nHits'][:,0] >>> nHits entry 0 28 1 11 2 25 3 23 4 22 .. 7564 11 7565 17 7566 24 7567 24 7568 11 Name: nHits, Length: 7569, dtype: int32 >>> ``` I can then manipulate the Pandas DataFrame e.g. apply cuts, bin variables into numpy histograms, make plots, etc... For example, I can cut on the nHits variable, removing tracks with less than 15 hits ``` >>> better_track_sample = df[df['nHits'] >= 15] >>> better_track_sample['nHits'][:,0].shape (3868,) # remaining tracks in this sample ``` Although I calculate the chi squared in my Analyzer and write that to its own variable. I could in principle use pandas and numpy to calculate it for each track like this: ``` (df['pull_hittime']**2).sum(level=0) / df['ndof'][:,0] ``` where i've summed over the squared pulls on each track hit and divided through by the number of degrees of freedom on each track: ``` >>> (df['pull_hittime']**2).sum(level=0) / df['ndof'][:,0] entry 0 1.270098 1 0.951465 2 3.550912 3 1.875245 4 0.702997 ... >>> df['chisq'][:,0] entry 0 1.270098 1 0.951465 2 3.550912 3 1.875245 4 0.702997 ... ``` Another thing I'd say is Python and libraries such as matplotlib, uproot, pandas aren't perfect 1:1 replacements for ROOT, and some things that are natural in ROOT are very un-natural in python (e.g. plotting a pre-binned histogram...)