Delayed reader for user-supplied input sources

knoepfel commented 2 years ago

This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/26097 (FNAL account required) Originally created by @tomjunk on 2021-08-03 15:38:50

DUNE would like delayed-reader functionality for user-supplied input sources. Currently we rely on the rootinput source's delayed-read functionality to conserve memory. When reading a large amount of data per event, we rely on being able to read part of it (divided among TTree branches) and free up the memory with removeCachedProduct when we are done with one piece of an event's data and we want to move on to the next piece. While this works for input ROOT files (as long as the data are distributed among separate TBranches), our DAQ people are planning to provide HDF5-formatted data. The files will have manageable-size datasets inside them, but we need a mechanism by which the source can provide for delayed reading. This will require some coupling between art and user code. A module consuming data would need to know somehow what data are available for reading but have not yet been read in, and when data are requested, user code will generally need to reformat and/or decompress the data as it comes in. We will want to do our own schema evolution in order to avoid multiply buffering input data. We anticipate some sort of index will need to be created by the source when an event is first encountered, and callbacks to methods in the source to do the delayed reading. Placing the data in the event memory is not required -- we'll want to free the memory up later anyway.

Below is an e-mail exchange I had with Kyle. I agree a service is a clumsy workaround and the wrong way to go. I am interested in discussing design details.

Hi Tom,

I'm very late in responding to this, and I apologize--too many irons in the fire.

Using a service is not the right approach--it introduces a globally accessible structure where none is necessary. It would also introduce configuration changes outside of the 'source' table, which breaks the abstraction of I/O vs. computation and creates more responsibility for the user. Only the event principal needs to know how to delay-read products, and an event-principal takes a delay reader as one of its constructor arguments. So that's the right way to go.

Every job that uses RootInput uses a delayed reader under the covers, whose responsibility is to load the products when they're retrieved for a given event, subrun, or run. The ability to remove from memory a product that is no longer needed happens at the event-handle level--i.e. that will be possible no matter how you load the data product into memory. Regarding thread-safety--a new delayed reader is created for every event, thus insulating one delayed reader from another, ameliorating some thread-safety issues.

Bottomline: the service approach is a workaround to solving the actual problem--for which there already is a solution. :)

I'm happy to consult or provide some assistance in creating the delayed reader. Just let me know.

Kyle

On Apr 21, 2021, at 2:01 PM, Thomas R Junk trj@fnal.gov wrote:

Hi Kyle,

Yes, the source we are using (example from Kurt Biery) to read HDF5 files uses the art::Source template, though we expect that it will take a bit of a redesign in order to provide this functionality. So the source would have to put something in the event to indicate what products are available for delayed reading, and the call to a product-retrieval method would have to call something registered by the source to perform the actual data readin. And we need the ability to remove the cached products when we're done with them. We'd also have to define data products for these things we read out of the file but don't want to save. Currently we make artdaq::Fragments out of HDF5 input data, but it's not really necessary to use that data structure since we aren't necessarily using artdaq to make the data in the first place; it's just a convenient container.

I think this can all be accomplished with a service, which might be simpler even on our end, but it might not be thread safe.

-- Thomas R. Junk Senior Scientist

Neutrino Division Fermi National Accelerator Laboratory 630 840 3072 office www.fnal.gov trj@fnal.gov

From: Kyle Knoepfel knoepfel@fnal.gov Date: Wednesday, April 21, 2021 at 9:55 AM To: Thomas R Junk trj@fnal.gov Cc: artists artists@fnal.gov Subject: Re: Looking for suggestions on how to arrange HDF5 file reading in art

Hi Tom,

Thanks for the email. What you actually need is something called a "delayed reader," whose purpose is to read the product off of disk whenever the first product retrieval happens in a module--not when the event is read from the file. This is how ROOT-product reading happens.

Although the framework generally supports this behavior through the Principal constructors, if you're using the art::Source template, you don't have access to this interface. Can you confirm that you are using the art::Source template (including the SourceHelper::makeEventPrincipal interface)?

If so, then I can imagine extending the interface of SourceHelper::makeEventPrincipal to optionally accept a delayed-reader argument. At that point, we will need to develop a concrete class of the DelayedReader base class, which isn't necessarily trivial, but it will get you the functionality you require.

Kyle

On Apr 14, 2021, at 5:23 PM, Thomas R Junk trj@fnal.gov wrote:

Hi Folks,

DUNE DAQ people are interested in providing raw data in the HDF5 file format. I got an example art source from Kurt Biery, and I finished re-coding it to the C API for HDF5 which was a bit clumsier than the C++ API but not enormously so. It now works again as originally designed.

The reason I'm asking is about extending the design to do things with HDF5 we had figured out how to do with ROOT, but which took some DUNE-specific design work. The main issue is that we cannot afford to have all the raw data present in memory all at the same time for a DUNE far detector trigger record, and it would be good if we could also process prototype data in pieces too.

To do that, we relied on lazy reading and removeCachedProduct to conserve memory in the art event for the input data from the artroot file, and keeping the unpacked raw digits out of the event, processing them first and putting ROI'd versions in the event. So far so good.

We'd like to do something similar for non-ROOT data too. There's nothing special about HDF5 in this respect, it's just not ROOT data. We need to open the file and store a file descriptor somewhere, and be able to go back to it partway through processing a trigger record and get more data. Presumably the source can read metadata which describes what's in the input file, and make that available somehow to downstream modules (or more likely tools). These tools then will call methods that use that stored file descriptor to retrieve actual data.

It sounds like a service may be the only solution here, for storing the file descriptor and providing access to the input file. There is a concern about thread safety as well, and we don't want two threads to try to process the same chunk of the input file, but that's on us to schedule our work so we don't trip over ourselves. The art event infrastructure seems to provide all of this functionality but it might be tied to how data products are stored in ROOT trees or at least labeled with art-style labels which we don't necessarily get with a DAQ-formatted HDF5 file.

So.. is this an abuse of a service?

Tom

knoepfel commented 2 years ago

Comment by @tomjunk on 2021-09-01 00:24:20

I think I misread Kyle's e-mail. Looks like the functionality needed in art already exists, and it's our job to create our experiment-specific delayed reader. I do have some questions about how to go about writing and using a delayed reader. We may still have to put some metadata into a data product in the event in order to communicate with modules consuming data what data there are that can be read in, and then how to retrieve products is a question.

tomjunk commented 2 years ago

I think we solved this issue for streaming hdf5-formatted data into an art job, at least for the DUNE Vertical Drift coldbox case, with a use-case-specific solution. Our solution was to put the HDF5 file descriptor in a data product in the event, and downstream modules and tools could use that to access the file and get what they needed from it. The input source opens and closes the file, gets the run and event numbers, and puts the file descriptor in the event. We like this solution because downstream modules can discover the contents of the file by reading HDF5 group names, attributes and dataset names, without having to query for the presence of each kind of data one at a time. A delayed GetManyByType would be needed if we didn't know what was present in the file and we just want everything, but we wouldn't want to read it all at once, just one bit at a time. This approach gives us the flexibility to explore the contents of the file in downstream modules and pick what we want.

It's an application-specific solution as it deserializes the data on read-in, and it knows about the format of the data inside the HDF5 datasets. But such a solution helps us organize our i/o the way we want to do it. That said, the FD sim/reco people are now having issues on the output side -- the desire is to enable streaming on output, but that is a separate feature.

This feature request can be canceled, unless others want this sort of functionality.

art-framework-suite / art

Delayed reader for user-supplied input sources #111