AIDASoft / podio

PODIO
GNU General Public License v3.0
24 stars 60 forks source link

Frame serialization/deserialization #565

Open faustus123 opened 8 months ago

faustus123 commented 8 months ago

I would like to read frames(events) from a PODIO file and stream them over a network to a remote process. I would then like that remote process to process the frame in the same way it would if it had been read from a local file using the existing API. This would need to honor any associations.

Note that I'm not really interested in solutions like just letting xrootd handle the transfer since I need some control over the stream(s), buffer headers, and networking details.

hegner commented 8 months ago

To get an understanding how low level you want to go here - could you write down some pseudo-code to show which parts you would do yourself and which parts you'd expect PODIO to do? And the atomic piece of the streaming - would it be an entire frame or single collections? And do you expect both sides to use the same language?

tmadlener commented 8 months ago

From a purely technical point of view, as long as you have a FrameDataT that effectively implements the same functionality as the EmptyFrameData everything should work as expected when you construct a Frame from it.

https://github.com/AIDASoft/podio/blob/d12cf45698f9764cee061b19acc73afd5bf22b67/include/podio/Frame.h#L41-L61

I think something like this would be the thing that goes over the network, as you have effectively full control of what you put in there from a content and also technical perspective. The main thing that could make this a bit more complicated is the fact that the collection ID table and the IDs that are in the buffers need to be consistent. When starting from a file this should not really be a problem, I think.

faustus123 commented 8 months ago

Here is some pseudo code along the lines of what I was thinking of:

//----------------------------------------------------------
// For the sender side
podio::ROOTFrameReader m_reader;
m_reader.openFile( GetResourceName() );

for( int i=0; i <  m_reader.getEntries("events"); i++){
    auto frame_data = m_reader.readEntry("events", i);
    auto frame = std::make_unique<podio::Frame>(std::move(frame_data));

    std::vector<uint8_t> buff;
    frame->Serialize( buff );

    // send buffer to remote
}

//----------------------------------------------------------
// For the receiver side

while(is_connected){

    std::vector<uint8_t> buff = ReadBufferFromSocket();
    auto frame = podio::Frame::Deserialize( buff );

    // Do something with collections in frame
}

I also could see needing a couple of calls to handle the non-event data that would be done at the beginning of each program.

As for the podio::CollectionReadBuffers class, I guess I'd have to look into how to serialize those as individual objects. It looks like the opening of a rabbit hole that I was hoping to avoid. Perhaps with a little more guidance I could look into it.

tmadlener commented 8 months ago

This looks quite sensible. As I mentioned before, I am not sure I would put the de-/serialize functionality on the Frame or whether I would create some new FrameData type (or extend the existing ROOTFrameData) that has that functionality. It could avoid some up-front work that happens when constructing a Frame from the frame data.

Is reading from a file the main use case here, or do you envisage also having some algorithm create / populate a frame and then send that off somewhere? If it's mainly the former I would probably go for a solution involving the FrameData as we probably have easier access to some "buffer like" data. However, if the latter is also a use case then the Frame would be the more natural point to tack the functionality on, I think.

The CollectionReadBuffers are something between a useful abstraction and a bit of a hack at the moment, tbh ;) Effectively they are a void* to the data buffers and some std::functions that do the actual work, where we generate most of them via our templates to inject type information back into the system. In principle it would be possible to add a

std::function<void(podio::CollectionReadBuffers const&, std::vector<uint8_t>&) serialize;

as a member function and then populate that with the correct implementation for each type at code generation time. However, if we really need to make them part of the "public" parts of podio, we should probably think about whether that is the best way to go about it.

Do you already have some library that does the de-/serialization? In case you haven't, I think the SIO backend that we have solves quite a few things already, and we might be able to use that to create a new set of readers and writers that effectively write to / read from a socket and otherwise simply use functionality that is already present.

faustus123 commented 8 months ago

My short term goal is to read an ePIC simulated data file, split the data into multiple streams, and recombine them in a specialized JEventSource. This will allow the standard ePIC analysis to be run with data straight from the stream. The first step will be a single stream, but multiple streams will hopefully soon follow.

My longer term goals include dynamically filling a frame in memory and then serializing it. I can't say for certain though how far upstream PODIO will go in ePIC since AFAIK it has not been seriously discussed. For the purposes of streaming system development though, it will be a very useful tool.

tmadlener commented 8 months ago

Just for my understanding and clarification: You do not actually care how we do the de-/serialization, right? This would include the actual type of the buffer. So does it have to be a vector<uint8_t>, or could we also use a vector<char> or something else that resembles "a collection of bytes" as long as we know how to interpret them?

faustus123 commented 8 months ago

Correct. I would just need a buffer reference and its size so I could pass it to a generic write command. The data type can be anything that represents a collection of bytes.

hegner commented 8 months ago

OK. And the required granularity for you would be on the frame level only for the time being? We could provide you with something to play with relatively quickly once we finished a few other outstanding issues. We'd for obvious reasons put things into an experimental namespace for the time being.

faustus123 commented 8 months ago

Yes, frame level would be good for now. Collection level may be useful later, but it brings the complication of how to handle associations so I'd rather push that headache down the road until we have a clearer motivation for it.

Understood on the namespace.

faustus123 commented 7 months ago

Just checking on the progress here. I have an LDRD milestone that would benefit from this, but will implement a different, temporary hack there if the timeline is going to be more than ~1week. Not pressuring anyone, just figuring my best course of action.

faustus123 commented 7 months ago

FYI: I did get some sudden inspiration over the weekend and have implemented what I think may be a solution. It did require some modification of the RootReader class. This essentially adds a openTDirectory method as an alternative to openFiles. Unfortunately, I ran into an issue building our primary recon program in order to test it. I think that is just a versioning issue on our end. I'll work on it more when I can find time later in the week and will report back then.

hegner commented 7 months ago

Thanks. On my side there are Easter vacations scheduled so no progress in the next week. Your idea sounds interesting, but in the long run we want something not depending on ROOT there.