Lazy loading / Iterating on parts of file

zonca commented 10 months ago

Referring to f = EventioKaitaiParser.from_file(TESTFILE)

Checking the code of the generated python extension, this loads the whole file eagerly in __init__.

It seems also there is no way to do lazy loading / iterating over seq parts of the file.

Is this a limitation of kaitai in general or only of the generated python code?

_Originally posted by @maxnoe in https://github.com/cta-observatory/eventio_kaitai/pull/1#discussion_r1389253418_

jpivarski commented 10 months ago

This is only a limitation of the generated Python code. In our last Zoom call, this is what we were talking about that would have to change.

Some file formats have very natural breakpoints where eager-reading stops and waits for a user choice about what to read subsequently. For example, it's natural to read a ZIP or HDF5 file up to the point where you get a listing of what's in the file, then wait for the user to choose which subfile or Dataset to actually extract. ROOT files have that stopping point at the TDirectory and TTree metadata, and then you'd want to iterate through data in the TTree, because it is a kind of sequence. Some file formats are only for iteration, without any header/directory structure at all, like CSV or newline-delimited JSONs (or newline-delimited anything).

Kaitai is more wide-open about the kinds of files that it supports, so Kaitai itself doesn't have a concept of a directory or other stopping point. Everything is a sequence of data instances, but some of those sequences are headers that you want to read in their entirety while others are big-data payloads that you want to iterate over. Therefore, that information about the stopping point has to be injected somehow.

I don't think the Kaitai KSY language has a way to say "this is a point where eagerness processing ends" (although delimited structures might be part of a solution). For most targets other than the new Awkward one, the Kaitai developers expect users to write a main function in the target language, which is where choices like this would get expressed. Since the Awkward target aims to make a push-button Python module, this kind of choice would have to get configured in the production of that Python module.

As far as I can see, there are two kinds of stopping points:

We've read the header/metadata/column listing and now present the user with a choice of which nested objects to read. The structure that the load function returns is record-like, and we'd return it without some fields. Subsequent load calls (or maybe a different function name, like subload) either return the whole record again with the new field included or just the field value by itself.
The data that we're returning is a stream—i.e. list-like. The objects in the stream have delimiters between them and each batch contains a whole number of values. For instance, if it's newline-delimited JSONs, we don't stop reading in the middle of a JSON line. In Python, this would be more natural as a generator/iterator that yields data, rather than returning it.

ZIP is a format that only has stopping point (1); CSV is a format that only has stopping point (2) (because we wouldn't stop reading and return control to the user after only reading the one-line header—there would be no choice for the user to make, anyway). ROOT is a format that has both stopping points (1) (TDirectory and TTree metadata, and soon also RNTuple metadata) and stopping points (2) (iterating through TTree or RNTuple entries).

First question: what kinds of stopping points does EventIO have? If it has a header that requires the user to make a choice about what to read next, like ZIP or ROOT, where is that point? is there more than one? and how would we design an interface to Awkward-Kaitai to say where it should stop that is not EventIO-specific?

Here's an idea: there's no reason to avoid reading non-list-like data. If there's a header, we can read all of its scalar fields (I'm including records within records in the word "scalar"). Only list-like data can be potentially large. A rule like "Don't eagerly read any list-like fields" is too strict; in a domain-specific setting like EventIO, some lists are known to be small. For example, the shape of a NumPy array is small; the number of values in the array can be huge. The load function could take a list of paths, like "field_name.nested_field_name.innermost_field_name", to not read eagerly. Since they're lists, the load function can always return data of the same type by returning these lists as empty lists. The generic Awkward-Kaitai framework can consider these configurable, while an EventIO reader built on top of it can bake in those field names.

I know that EventIO has stopping points of the (2) type, so there would be a second function—actually, a generator/iterator—that yields batches of data from the list that was excluded from the first read. This second function would have to be configured by "How many entries per batch?", which even a domain-specific EventIO reader would expose to the user.

On the Zoom call, I was said that I was considering reintroducing virtual/lazy arrays into Awkward Array as a way to express these stopping points, but @agoose77 talked me out of it. (It would be very disruptive to the Awkward infrastructure, and you can get your laziness by having multiple functions in Awkward-Kaitai, as I've described above.) So Awkward Arrays, the return and yield values of these functions, would be eager, in-memory artifacts, though the whole file would not be represented by a single Awkward Array.

In what I described above, the two functions, "Load from the beginning of the file, but exclude a given set of nested field paths (that have list type)," and "Iterate over batches of a given nested field path (that has list type)" would be purely functional. After the first function returns data whose type is the whole file but has some list-like fields missing, it deletes all of the C++ class instances it used to produce the Awkward Array, closes the file, and returns just the Awkward Array (possibly ak.Record, rather than ak.Array). After the second function yields a batch of items from its list, the C++ instances associated with that batch get deleted, the Awkward Array is yielded, but the file remains open for more reading. There's mutable state in the iterator ("Where in the file are we?"), but it's the mutable state of a coroutine.

maxnoe commented 10 months ago

First question: what kinds of stopping points does EventIO have?

EventIO, the basic file format, has the "object" as only unit. Iterating over these objects is the basic interface.

It's made a tiny bit more complicated by the fact that some objects can be "containers", i.e. are known to be streams of objects. Since you can only read most compressed streams forward efficiently, we iterate depth first.

That's the basic interface I'd expect for a lazy loading eventio reader.

Then there is the question of specific data formats using eventio, we have the two variantes: output of the CORSIKA iact extension storing Cherenkov light on the ground and output of sim_telarray, storing telescope data.

These both consist of some header-like information in multiple eventio objects, followed by a sequence of air shower events, also consisting of multiple objects each. In the end, there is also some footer information, e.g. summary statistics about the simulation.

Depending on configuration of the software, the structure changes a bit (more information can be saved, etc.).

Here, the natural interface is to read the header part when opening the file and offering iteration / lazy loading of the air shower events, providing the footer information once the loop has been exhausted.

cta-observatory / eventio_kaitai

Lazy loading / Iterating on parts of file #7