Python analysis package

LDMX-Software / fire

Event-by-event processing framework using HDF5 and C++17

https://ldmx-software.github.io/fire/

GNU General Public License v3.0

1 stars 0 forks source link

Python analysis package #27

Open tomeichlersmith opened 2 years ago

tomeichlersmith commented 2 years ago

Goal

I want a user to be able to do the following

import fire.ana
with fire.ana.File('input.h5') as f : 
    for event in f : 
        total_E = sum(event['recon/hits/energy'])

This will make fire explicitly depend on h5py, but only at this module level.

I'm thinking the implementation of this would be similar to the current Framework's EventTree module while using h5py to access the data sets on disk. Similar to the EventTree module, this would only be designed to read fire files. The user could still produce other HDF5 files with direct access to h5py, but those files will not be standardized in the way fire files are. (This is similar to how ROOT-based Python analyses function as well).

tomeichlersmith commented 2 years ago

Supporting a to_awkward read method would help support some current python analyses

tomeichlersmith commented 2 years ago

I think borrowing (stealing?) some design principles from uproot is beneficial in order to align with some current python analyses (as well as recognizing that uproot is a well designed package).

The main components I see are the following:

Chunk-based loading (what uproot calls "lazy" loading) so that only a few events' data are in memory at a time.
Full loading to get all requested data into a python object in memory
Loading into structures from three popular packages: numpy, pandas, and awkward
Wrapping of numpy-like slicing into loading function (not for improved performance, just to mimic TTree::Draw access pattern)
I/O of binned data (histograms)
File exploration (already supported via h5py methods, perhaps add wrappers to make it more similar to uproot)
Loading of multiple files is done in similar ways as single file (either through complete or chunk loading)

The h5py package backing us allows us to avoid the actual disk reading that uproot needs to implement. What this package would need to focus on is the recursive reconstruction of hierarchical data from the "flattened" data that is within the HDF5 file. This will be similar in structure to the h5::Data class I assume.

tomeichlersmith commented 2 years ago

Awkward

The function ak.zip is probably what we want. This would allow us to choose the objects to load and zip them together into the ragged-array style of awkward.

Pandas

pandas.DataFrame can simply wrap numpy arrays. Will need to test the performance, but that might work the best.

Numpy

This is the default return value of h5py so I doubt anything much heavier is needed.