XENON1T / pax

The XENON1T raw data processor [deprecated]
BSD 3-Clause "New" or "Revised" License
16 stars 17 forks source link

Consistency on ROOT file interfaces #176

Closed tunnell closed 8 years ago

tunnell commented 9 years ago

Currently, we rely heavily on pyROOT to do our ROOT access. If #175 is accomplished, we should consider using rootpy to serialize our event class. This may involve improving rootpy. Otherwise, I propose that we develop a standalone tool for serializing arbitrary Python classes. This should be an open source project in its own right. I hope that it uses 'typing' in the new Python 3.5, which should make introspection of the Python class easier. I envision the following code:

import tfile

class Event():
   ...

tfile.setup('something.root', Event)

foo = Event()
foo.bar = 3

tfile.write(foo)

tfile.close()

It doesn't have to use Python's typing, but seems like what it is for. The main requirement is that there is some way of implementing the logic above that doesn't suck (like it does in my current root_output branch).

tunnell commented 9 years ago

https://github.com/rootpy/rootpy/issues/35 and https://github.com/rootpy/rootpy/issues/642

JelleAalbers commented 8 years ago

I'm getting more interested in this as I'm learning slowly learning root. I've collected some questions that came to my mind (some probably stupid, sorry) about the output format that may be of interest to @pdeperio and @malfonsi too.

tunnell commented 8 years ago

Good barrage of questions. I'll say something more general than answer your questions.

As you know, we have a 'eScience engineer' who's working on this as part of a Path Finding Grant. But what is the path? Most of the technical things you ask about are already implemented in ROOT or should be straightforward to implement. The 'path' is that ROOT is crippled in Python (especially 3) and since nobody in the world outside of HEP does analytics in C++, this limited our abilities to use new tech (making learning curve less steep also useful). However, there is a killer feature of ROOT that means we want to use it as a backend for analytics: column stores. This is the same thing that MonetDB from these CWI look at but in database format (i.e. huge DTEC grant and others). The question is then being more specific: can we make doing HEP analytics look the same whether it be on a MongoDB, MonetDB, ROOT file, HDF5, whatever. There are a few tools we/Daniella had to improve and fix before we could even answer that question.

Now point by point.

malfonsi commented 8 years ago

Hi Jelle et. al., This is the information that I collected in the last months. It is based especially on the ROOT Users' Manual chapters "Input/Output", "Trees", "Adding a Class" and the example "Event" class that you can find under the test subdirectory of a ROOT installation. This example looks like the "suggested way" from the ROOT team, but I am not sure if and how this is changing in the next future.

Split branches. When you create in a TTree a TBranch of an object, you can ask to split the object further in branches descending through the hierarchy of embedded objects, up to a certain level. So it is a shortcut (which also facilitates maintenance, extension, etc... ) wrt declaring one by one branches of member objects, even if I am not sure if the two are fully equivalent. A single branch has its own “memory buffer” where data is loaded from ROOT file, and, afaik, branch data are written sequentially in the same “block” of the ROOT file(s). This is usually what you want for performance, because you typically want to read from disk possibly in big bunches only the part of the object that you really need.

Variable lenght arrays. You can have them. The most simple way is just having another object variable reporting the number of element per event, e.g.:

    class MyMainObject {
      Int_t nElements;
      MyEmbeddedObject* pointerToArrayOfEmbeddedObjects; //[nElements]
    }

The comments field matters, as this is used by rootcling to write the specific Streamer() method for your class. This method is used for serialisation, and you can customise it if you need to.

You can also use e.g. STL vector or similar (but I think that TVector is not done for this purpose, it is part of the linear algebra package). However for array of identical objects the suggested approach, for performance reasons (but with cons, too) seems to be still the TClonesArray. Basically it makes sure that the space allocated in memory for your objects is reused every event, avoiding time-expensive memory allocation/deallocation. When you dig more into the ROOT code, this looks a strategy widely used for most objects saved in a TTree.

Multiple Access. This is done by TRef. The object is written only once and the "pointers" to it in other objects get a reference number. A TBranchRef in the TTree makes the loading of the referred branch happen automatically. In principle with TRef you can also call some code that loads the necessary information (e.g. querying the DB for the "hits" info not present in the ROOT file? In principle this is possible...)

Helper methods. Not sure if this answer completely the question, but any member function of the class is available, provided that the "class library" is made accessible (e.g. loaded in the ROOT interpret or the pyROOT equivalent - I am not a python/pyROOT expert - or linked somehow to an executable). I think (but I can be biased) that the easiest way is to port the "object" into a C++ class, because you can easily import it in python, while I am not sure of the other way around.

Storage of large fields + Storage of hits. I agree that a friend-ed TTree is a solution. As I said above, a TRef can execute some code to retrieve the needed information, e.g. accessing the DB, but probably you want to resort to this only to very rarely required information.

Peak-level cuts. TTree::Draw does that. The syntax gives you a lot of possibilities to cover most of the required case. This is something expected to change with version 6.06 (maybe November): so far there was a class, TTreeFormula, that was parsing and interpreting the expression; in the next version the expression will be given to the Just-In-Time compiler of ROOT. I cannot find so far any precise information. Anyway, TTree::Draw do not (or, if you write your own code, you should not) collect all peaks from all events before looking into them, but just for each event it extracts the required properties for peaks of that event. Splitting your objects in more branches allows TTree::Draw (or your own code) to effectively read from file only the required information.

Read code . Once you can write, you can also easily read

JelleAalbers commented 8 years ago

@malfonsi and @tunnell, thanks for your answers! I think I'm slowly understanding why ROOT would be such a nice format to have. If we can just overcome these final technical challenges we should be able to do some nice analyses with it soon.

@tunnell and @remenska:

If I'm right the approach you're pursuing now is to replace pax's datastructure by a rootpy TreeModel. Might I suggest we try to create an ordinary output plugin for ROOT instead? We have done this for all output formats up until now, to avoid major surgery on the pax internals when output preferences change (which happens, as evidenced by our large number of output formats). If you do go the datastructure-replacement route there's a couple of problems you should be aware of:

I've added some example code for a root output plugin in the root_output branch (see plugins/io/ROOT.py). I stole some stuff from the notebook @remenska posted on gitter and from rootpy's examples. You can test it with

paxer --output my_rootfile --stop_after 2 --output_type root

(if it doesn't work it's probably because I'm on python2 + Windows :-). The output is very basic (only some event fields) but I think the idea is clear...

... though I wonder if we can do this thing in rootpy at all, regardless of how we do it. I couldn't find any mention of TClonesArray's in rootpy, nor anything on the creation of split branches. It looks like we want the combination of these, so we'll either have to contribute this to rootpy, find something equivalent that rootpy does offer, or don't use rootpy.

If you have time for the first that would obviously be nice; rootpy has a nice API we could then also use for reading the data back in. However, if then want some advanced feature later, we'd have to go back and change rootpy again. I believe rootpy is essentially a wrapper around pyROOT (it's pure python after all, which is why it works on windows :-) to make some common stuff more pythonic; but it seems we're trying to do some advanced things the wrapper can't yet do.

I never thought I would say this, but it seems to me the easiest solution is to just autogenerate a C++/CINT/whatever class, load it with ROOT.gROOT.ProcessLine, then fill and write the tree with pyROOT. You can steal examples on how to do all the nice things we want from the ROOT users guide and maybe other HEP code. pyROOT exposes the usual ROOT API, which is very well documented and supports all the advanced features.

(I know from @tunnell pyROOT is supposed to be mythically slow, but if that's true rootpy is not better (unless we could somehow use root_numpy on our advanced tree (which I doubt since numpy doesn't support ragged arrays)). Even then, as long as you don't write hits, output will remain a pretty small part of the total processing work).

Whatever you end up doing I'm willing to help, especially with pax integration (I don't know much ROOT and have little C++ experience).

tunnell commented 8 years ago

Replacing our data_models with the rootpy datamodels at the moment is the simplest, most standard, and probably best way of doing this. As I explained yesterday, some of these concerns are ungrounded and may result in a quite expensive 'conversion' step. At the end of the day, if you convince @remenska that your way is best, then she'll implement it the way she feels is best.

Nevertheless, I will go quickly point-by-point:.

What's all this comments about TClonesArray? Day 1: data in ROOT file. Iterate from there (think Agile).

We're improving rootpy along the way for everybody's benefit.

We did the "Python writes C++ class text file that gets loaded" thing. It's sitting in a branch of mine. It's a terrible way to do this. The point of this work is to find a better way.

root_numpy is actually part of rootpy. Both are better in terms of speed.

Your help is certainly appreciated and will be useful. We are thinking before we work though. :P

tunnell commented 8 years ago

The tighter we integrate the output into the core datastructure of pax, the less chance we have of every having huge delays from an S1 misordering problem again. Save what we measure instead of a somehow converted version.

remenska commented 8 years ago

@JelleAalbers Thanks for these points. If I understood correctly, your concerns are around breaking the current data_model.py functionality, and variable length arrays (which should be possible with rootpy, as soon as I figure out why stl dictionary generation fails with compiling. This was one of the rootpy issues to be looked into and fixed eventually). @tunnell's idea to subclass rootpy's TreeModel should solve the first one. Sorry I was mostly playing around with rootpy's possibilities, ignoring your pax (Strict)Model, just for the time being.

I'm not yet clear how/if referencing other trees/branches will work in rootpy though (e.g., point to a list of Interaction/Peak/etc objects from within Event, or if that's even necessary)

Maybe a stupid question, but if the the pax data structure is reflected in the ROOT output via rootpy, what is the advantage of using a separate plugin instead, to build the rootpy model? If I understand your example, you're doing similar stuff, just building the Event class suitable for rootpy, from the pax Event one. Good point, we will need to take care of conversions between python and rootpy columns in a cleaner way.

I thought the whole idea was that PyROOT is not pythonic enough, why is it on the table again :P It is worrying if you have evidence that PyROOT is mythically slow, though, that's news for me.

JelleAalbers commented 8 years ago

@tunnell / @remenska any objections to closing this old issue?