AIDASoft / podio

PODIO
GNU General Public License v3.0
24 stars 60 forks source link

Davidl root reader t directory #579

Open faustus123 opened 6 months ago

faustus123 commented 6 months ago

BEGINRELEASENOTES Add ROOTReader::openTDirectory() to allow external management of TDirectory (e.g. TMemFile) to support streaming. ENDRELEASENOTES

This is motivated by issue #565 . It makes it possible to write a stream source that sends data packets in the form of serialized TMemFile objects using ROOT's built-in serialization/deserialization functionality. Each TMemFile may contain a small number of events in the "events" tree and the complete metadata trees allowing the other standard ROOTReader methods to work. I have a working test case that reads events from an existing PODIO ROOT file and sends them this way via zmq messages. This couples with the ePIC recon code and works so there is a proof-of-principle that this works as needed.

A couple of notes on what was done:

  1. A TTree *m_metaTree member was added to take the place of the TChain in most of the calls for which a TChain is not actually needed. When a TChain is actually used, this just points to the TChain.

  2. The TChain was changed from being kept in a unique_ptr to being a direct member of the class. I suspect the unique_ptr trick was used because TChain has its copy and assignment constructors unavailable as private members of the TChain class and that causes some issues with storing these in std::map. I ran into other complications with trying to implement m_metaTree to support both cases (an externally supplied TTree or an internal TChain) I resolved it by explicitly deleting the copy and assignment constructors for ROOTReader which avoids the problem of ownership of the TChain. Removing the unique_ptr mechanism was then possible which simplified things in the class. There may have been other reasons for using unique_ptr which I don't know about.

  3. A new test was added read_frame_root_tdirectory which is almost a complete copy of read_frame_root_multiple. I can't claim to fully understand the testing configuration, but hopefully this is in line with what you would like.

faustus123 commented 6 months ago

One question: How do you build the TMemFiles on the writing side? (Or plan to build)

You can see a pair of programs that do the serialization/deserialization and zmq messaging here. These are independent of the PODIO library. They are specific to the ROOT file format of the ePIC simulated data files in that it assumes 4 TTrees with the event data being kept in "events".

https://github.com/JeffersonLab/SRO-RTDP/tree/davidl_podio2tcp/src/utilities/cpp/podio2tcp

How this is implemented on the ePIC side is via a JANA JEventSource plugin. A rough prototype of that can be seen here. It mostly copies the one already in EICrecon that uses ROOTReader::openFiles(). This will need a lot of cleanup, but the core functionality is there.

https://github.com/JeffersonLab/SRO-RTDP/blob/davidl_podio2tcp/src/utilities/cpp/podiostream/podiostreamSource.cc

My main concern for this approach is that this ties the streaming of Frames rather tightly to ROOTs TTree backend. I would have tried to serialize on the FrameData level, which should be possible without any dependency on a specific backend. However, since you have a proof-of-principle already for the full chain, your approach is potentially a bit easier to implement (at least on the podio side), since a lot of the heavy lifting is done else where. You probably have a better feeling for how involved the overall approach is.

I agree 100% that a better implementation will be to do this at the Frame level. This approach was quicker for me to implement and didn't require my digging too deep into the PODIO source. It will serve my short term goals until someone more familiar with PODIO can do it the right way. I also think the openTDirectory() still adds some flexibility that may be useful in some other cases.

Another somewhat smaller concern is performance. When I first implemented Frame I/O I didn't pay too much attention to the things that happen "only once", e.g. when opening a file, or when first setting up a category for reading the first entry. Hence, I crammed quite a bit of (potentially) unnecessary work into that since the overhead will be distributed in any case. I am not entirely sure how things look if this "overhead" happens repeatedly for just a few events. In case all of the events that you stream have the same metadata, and the same categories there might be a way to cache a bit more of this information between calls to openTDirectory in case performance becomes an issue. Until that is the case, I would prefer the current version, since has fewer assumptions and is more readable.

Yes, performance is not great, but it is currently obscured by the horribly slow ePIC reconstruction code. When reading from a file, it takes 2-5 seconds to reconstruct an event. Preliminary testing with pre-formed send buffers yields sending takes only about 0.02 seconds/event.

When it comes to reading in the meta-data, one knob we have to turn is how many events are sent at once. I can just as easily send 100 events or more at a time which does reduce that overhead. I'll tune that as needed when I start scaling up for more performance testing.

tmadlener commented 6 months ago

Thanks for the links. After a quick look and IIUC the main use case here is to read from stream in Jana2, but the source of the stream would then be outside of Jana2? I am mainly asking, because I was wondering whether you also planned to implement (or actually need) the creation of the TMemFile somewhere in memory. Currenty, it seems that everything starts from an existing podio file on disk rather.

faustus123 commented 6 months ago

Thanks for the links. After a quick look and IIUC the main use case here is to read from stream in Jana2, but the source of the stream would then be outside of Jana2? I am mainly asking, because I was wondering whether you also planned to implement (or actually need) the creation of the TMemFile somewhere in memory. Currenty, it seems that everything starts from an existing podio file on disk rather.

Correct, there is no need for a JANA2 process on the sending side.

I do not anticipate using TMemFile in any production streaming system. It is just the (relatively) easy way to serialize ROOT objects which is the form the simulated ePIC data happens to be in. My expectation is that the DAQ WG will decide on some custom format for the stream data driven by the customized nature of the electronics. We may deal with that either in a specific JANA2 JEventSource or by some future transformer code that converts it into a PODIO stream (whatever that design turns out to be).

It is true that at the moment and for the next couple of years we will likely only have simulated data and it will be in the form of podio files on disk. Hence the need to generate stream(s) from that so the SRO infrastructure can start development.

faustus123 commented 4 months ago

I haven't heard anything on this for 2 months. Are there any plans to merge it or any hold-ups preventing it?

hegner commented 4 months ago

@faustus123 - apologies for having it dormant for so long. We've been doing some refactoring which we just managed to finish. If you can update your PR I can deal with it this week.

faustus123 commented 4 months ago

Uggh. I just tried to rebase to bring my fork up to date and a lot of changes have happened in the ROOTReader files in the meantime. It is going to take a while to try and resolve this. It would probably be more efficient if I do this after you have finished with whatever work you are doing there. Please let me know when you are finished with the current round of work and I will take a look at it then.

tmadlener commented 4 months ago

The changes that we want to go in first are the ones in #625. Looking at the changes in this PR, I am wondering whether you cherry-picking might be easier than rebasing?