Consistency on ROOT file interfaces

Currently, we rely heavily on pyROOT to do our ROOT access. If #175 is accomplished, we should consider using rootpy to serialize our event class. This may involve improving rootpy. Otherwise, I propose that we develop a standalone tool for serializing arbitrary Python classes. This should be an open source project in its own right. I hope that it uses 'typing' in the new Python 3.5, which should make introspection of the Python class easier. I envision the following code:

import tfile

class Event():
   ...

tfile.setup('something.root', Event)

foo = Event()
foo.bar = 3

tfile.write(foo)

tfile.close()

It doesn't have to use Python's typing, but seems like what it is for. The main requirement is that there is some way of implementing the logic above that doesn't suck (like it does in my current root_output branch).

I'm getting more interested in this as I'm learning slowly learning root. I've collected some questions that came to my mind (some probably stupid, sorry) about the output format that may be of interest to @pdeperio and @malfonsi too.

From what I've read so far it seems a tree with split branches matching our datastructure is what we're after. Is this what you're thinking about too?
Variable-length fields Will we need to store a fixed amount of e.g. peaks, interactions, etc per event, or is it possible to work with variable-length fields? The reason we currently (have to) use a relational structure in the output is that variable-length fields are impossible in numpy (and therefore pandas, if I understand correctly)... which makes me wonder how root_numpy works, but that's another matter.
Multiple access Peaks are currently accessible in three ways: directly from the event (event.peaks[0]), through the interaction objects (event.interactions[0].s1) and through helper methods (event.s1s(detector='tpc', sort_key='n_contributing_channels')). Will all three survive in the ROOT file?
Helper methods Likewise there are helper methods in the interaction objects (s1_corrected_area) and the reconstructed position (r, phi) that have very simple definitions. Will these be implemented in the ROOT file? If so, does there have to be separate C++ and python code?
Storage of large fields Some fields, especially the per-channel info in peaks (area_per_channel, hits_per_channel, ...), take muuuuch more space than others. Files with these files are very large (few GB for a XENON100 dataset, while without them just ~100 MB). We currently make "light"-files, where these large fields are removed, post-processing; these are very handy for transferring data to your own laptop. How will we deal with this in the ROOT output? Is it worth storing these in a separate tree, or even file, which is friended on load-in?
Storage of hits Related to this, in some cases it useful to write out the super-low-level info about hits and pulses. Is there a smarter way to do this than just make humongous files containing everything?
Peak-level cuts This question probably just reflects my ignorance of ROOT. When analyzing I often would like to plot some properties (e.g. area, width) of all peaks in all events that match certain criteria. Is there a convenient way to do this in ROOT, or will I have to first load all peaks by looping over all events?
Read code Finally I'd just like to stress the importance of also having read as well as write ability for the format in pax. This allows you to reprocess very quickly (and without access the raw data) if you change e.g. the S1-S2 classification, the spatial corrections, electron lifetime, or pairing in interaction objects.

Good barrage of questions. I'll say something more general than answer your questions.

As you know, we have a 'eScience engineer' who's working on this as part of a Path Finding Grant. But what is the path? Most of the technical things you ask about are already implemented in ROOT or should be straightforward to implement. The 'path' is that ROOT is crippled in Python (especially 3) and since nobody in the world outside of HEP does analytics in C++, this limited our abilities to use new tech (making learning curve less steep also useful). However, there is a killer feature of ROOT that means we want to use it as a backend for analytics: column stores. This is the same thing that MonetDB from these CWI look at but in database format (i.e. huge DTEC grant and others). The question is then being more specific: can we make doing HEP analytics look the same whether it be on a MongoDB, MonetDB, ROOT file, HDF5, whatever. There are a few tools we/Daniella had to improve and fix before we could even answer that question.

Now point by point.

Correct. Part of the work is to replace datamodels such that we can serialize our event class as is to and from ROOT TTrees.
Yes, easy to do. This is a TVector of classes. It's variable length.
Class variable will remain, methods may come later.
We have to patch in any class methods we want to ship with the ROOT file. This is partly why I want the class methods to stay simple in the event class. They'll be defined in C++, but accessible by Python.
You can store them as new branches if you want. The killer feature of ROOT files is the TBaskets. Every variable is stored together on disk, which means you can think of every branch as another file. This greatly simplifies the situation you're describing and is a big benefit compared to hdf5.
The hits issue is more general than ROOT output: we shouldn't store hits anywhere except in difficult cases or in debugging. It's just information density: you can't store all that info.
TDraw has a query syntax like SQL queries. You specify "ev.S2.area > 150" then it loads just the S2 area field and does the looping for you. We may make it possible to hide this behind a pandas like interface.
We are doing serialization, which of course includes reading back in.

Hi Jelle et. al., This is the information that I collected in the last months. It is based especially on the ROOT Users' Manual chapters "Input/Output", "Trees", "Adding a Class" and the example "Event" class that you can find under the test subdirectory of a ROOT installation. This example looks like the "suggested way" from the ROOT team, but I am not sure if and how this is changing in the next future.

Split branches. When you create in a TTree a TBranch of an object, you can ask to split the object further in branches descending through the hierarchy of embedded objects, up to a certain level. So it is a shortcut (which also facilitates maintenance, extension, etc... ) wrt declaring one by one branches of member objects, even if I am not sure if the two are fully equivalent. A single branch has its own “memory buffer” where data is loaded from ROOT file, and, afaik, branch data are written sequentially in the same “block” of the ROOT file(s). This is usually what you want for performance, because you typically want to read from disk possibly in big bunches only the part of the object that you really need.

Variable lenght arrays. You can have them. The most simple way is just having another object variable reporting the number of element per event, e.g.:

    class MyMainObject {
      Int_t nElements;
      MyEmbeddedObject* pointerToArrayOfEmbeddedObjects; //[nElements]
    }

The comments field matters, as this is used by rootcling to write the specific Streamer() method for your class. This method is used for serialisation, and you can customise it if you need to.

You can also use e.g. STL vector or similar (but I think that TVector is not done for this purpose, it is part of the linear algebra package). However for array of identical objects the suggested approach, for performance reasons (but with cons, too) seems to be still the TClonesArray. Basically it makes sure that the space allocated in memory for your objects is reused every event, avoiding time-expensive memory allocation/deallocation. When you dig more into the ROOT code, this looks a strategy widely used for most objects saved in a TTree.

Multiple Access. This is done by TRef. The object is written only once and the "pointers" to it in other objects get a reference number. A TBranchRef in the TTree makes the loading of the referred branch happen automatically. In principle with TRef you can also call some code that loads the necessary information (e.g. querying the DB for the "hits" info not present in the ROOT file? In principle this is possible...)

Helper methods. Not sure if this answer completely the question, but any member function of the class is available, provided that the "class library" is made accessible (e.g. loaded in the ROOT interpret or the pyROOT equivalent - I am not a python/pyROOT expert - or linked somehow to an executable). I think (but I can be biased) that the easiest way is to port the "object" into a C++ class, because you can easily import it in python, while I am not sure of the other way around.

Storage of large fields + Storage of hits. I agree that a friend-ed TTree is a solution. As I said above, a TRef can execute some code to retrieve the needed information, e.g. accessing the DB, but probably you want to resort to this only to very rarely required information.

Peak-level cuts. TTree::Draw does that. The syntax gives you a lot of possibilities to cover most of the required case. This is something expected to change with version 6.06 (maybe November): so far there was a class, TTreeFormula, that was parsing and interpreting the expression; in the next version the expression will be given to the Just-In-Time compiler of ROOT. I cannot find so far any precise information. Anyway, TTree::Draw do not (or, if you write your own code, you should not) collect all peaks from all events before looking into them, but just for each event it extracts the required properties for peaks of that event. Splitting your objects in more branches allows TTree::Draw (or your own code) to effectively read from file only the required information.

Read code . Once you can write, you can also easily read

@malfonsi and @tunnell, thanks for your answers! I think I'm slowly understanding why ROOT would be such a nice format to have. If we can just overcome these final technical challenges we should be able to do some nice analyses with it soon.

@tunnell and @remenska:

If I'm right the approach you're pursuing now is to replace pax's datastructure by a rootpy TreeModel. Might I suggest we try to create an ordinary output plugin for ROOT instead? We have done this for all output formats up until now, to avoid major surgery on the pax internals when output preferences change (which happens, as evidenced by our large number of output formats). If you do go the datastructure-replacement route there's a couple of problems you should be aware of:

While it's true we can ditch a small amount of code from data_model.py (type checking), we have to find another home for the rest of the functionality. For example: the flexible init (from kwargs, dict, init submodels from nested dictionaries, ...) used throughout pax, the conversion to dictionaries (used in BSON conversion) and a few helper methods like get_fields_info used by the tabular output.
If the way of accessing the data objects changes (e.g. instead of creating instances of Peak you have to assign to attributes of some tree in a loop, then call .fill() after each instance), we'll have to change all the plugins that make use of this.
There are two places (TableWriter, XerawdpImitation) where nefarious hacks are used to work around type checking in the datastructure for arcane purposes -- these will have to be fixed properly or replaced by even more nefarious hacks.
The length of some arrays (like area_per_channel) is is fixed (for a given TPC) but not known when the class is declared (because it comes from the config). You may have to wait until the first instance exists before building/finishing the root class declaration.

I've added some example code for a root output plugin in the root_output branch (see plugins/io/ROOT.py). I stole some stuff from the notebook @remenska posted on gitter and from rootpy's examples. You can test it with

paxer --output my_rootfile --stop_after 2 --output_type root

(if it doesn't work it's probably because I'm on python2 + Windows :-). The output is very basic (only some event fields) but I think the idea is clear...

... though I wonder if we can do this thing in rootpy at all, regardless of how we do it. I couldn't find any mention of TClonesArray's in rootpy, nor anything on the creation of split branches. It looks like we want the combination of these, so we'll either have to contribute this to rootpy, find something equivalent that rootpy does offer, or don't use rootpy.

If you have time for the first that would obviously be nice; rootpy has a nice API we could then also use for reading the data back in. However, if then want some advanced feature later, we'd have to go back and change rootpy again. I believe rootpy is essentially a wrapper around pyROOT (it's pure python after all, which is why it works on windows :-) to make some common stuff more pythonic; but it seems we're trying to do some advanced things the wrapper can't yet do.

I never thought I would say this, but it seems to me the easiest solution is to just autogenerate a C++/CINT/whatever class, load it with ROOT.gROOT.ProcessLine, then fill and write the tree with pyROOT. You can steal examples on how to do all the nice things we want from the ROOT users guide and maybe other HEP code. pyROOT exposes the usual ROOT API, which is very well documented and supports all the advanced features.

(I know from @tunnell pyROOT is supposed to be mythically slow, but if that's true rootpy is not better (unless we could somehow use root_numpy on our advanced tree (which I doubt since numpy doesn't support ragged arrays)). Even then, as long as you don't write hits, output will remain a pretty small part of the total processing work).

Whatever you end up doing I'm willing to help, especially with pax integration (I don't know much ROOT and have little C++ experience).

Replacing our data_models with the rootpy datamodels at the moment is the simplest, most standard, and probably best way of doing this. As I explained yesterday, some of these concerns are ungrounded and may result in a quite expensive 'conversion' step. At the end of the day, if you convince @remenska that your way is best, then she'll implement it the way she feels is best.

Nevertheless, I will go quickly point-by-point:.

Solved: We will baseclass the TreeModel probably and there's still introspection for to_dict etc.
Not issue: .fill() is one way, but I hope our event API doesn't change. This should be possible.
Your concern is that we'll have to fix properly nefarious hacks?
Not issue: variable length arrays

What's all this comments about TClonesArray? Day 1: data in ROOT file. Iterate from there (think Agile).

We're improving rootpy along the way for everybody's benefit.

We did the "Python writes C++ class text file that gets loaded" thing. It's sitting in a branch of mine. It's a terrible way to do this. The point of this work is to find a better way.

root_numpy is actually part of rootpy. Both are better in terms of speed.

Your help is certainly appreciated and will be useful. We are thinking before we work though. :P

The tighter we integrate the output into the core datastructure of pax, the less chance we have of every having huge delays from an S1 misordering problem again. Save what we measure instead of a somehow converted version.

@JelleAalbers Thanks for these points. If I understood correctly, your concerns are around breaking the current data_model.py functionality, and variable length arrays (which should be possible with rootpy, as soon as I figure out why stl dictionary generation fails with compiling. This was one of the rootpy issues to be looked into and fixed eventually). @tunnell's idea to subclass rootpy's TreeModel should solve the first one. Sorry I was mostly playing around with rootpy's possibilities, ignoring your pax (Strict)Model, just for the time being.

I'm not yet clear how/if referencing other trees/branches will work in rootpy though (e.g., point to a list of Interaction/Peak/etc objects from within Event, or if that's even necessary)

Maybe a stupid question, but if the the pax data structure is reflected in the ROOT output via rootpy, what is the advantage of using a separate plugin instead, to build the rootpy model? If I understand your example, you're doing similar stuff, just building the Event class suitable for rootpy, from the pax Event one. Good point, we will need to take care of conversions between python and rootpy columns in a cleaner way.

I thought the whole idea was that PyROOT is not pythonic enough, why is it on the table again :P It is worrying if you have evidence that PyROOT is mythically slow, though, that's news for me.

@tunnell / @remenska any objections to closing this old issue?

XENON1T / pax

Consistency on ROOT file interfaces #176