AIDASoft / podio

PODIO
GNU General Public License v3.0
24 stars 60 forks source link

Support for reading multiple files with SIO #647

Open tmadlener opened 3 months ago

tmadlener commented 3 months ago

Currently the SIOReader does not support automatic file switching.

SaxenaAnushka102 commented 3 weeks ago

Hello @tmadlener! I'd love to work on this. Could you please provide more details on the expected implementation? From my understanding, the new functionality will involve defining methods in SIOReader.h and implementing them in SIOReader.cc for automatic file switching. Does this approach align with what's expected?

tmadlener commented 3 weeks ago

Hi @SaxenaAnushka102, this will involve quite a bit of c++ and after having had another look into it, it might not be as straight forward as I had originally anticipated. Nevertheless, let me lay out the basic pieces of work that would need to be done and then you can still decide if you want to embark on this :)

The main work will have to happen in the SIOReader indeed. Then there are a few minor things in other places to remove some "artificial" restrictions. We will also fix a small issue along the way


The main building blocks of the SIOReader are:

There are a few more members, but we don't care too much about them here.

The small issue

The SIOReader has a bool readFileTOCRecord(), that reads a table of content (TOC) record which is then used throughout to jump to different positions inside the file. However, we never check whether this function call actually succeeds in openFile. https://github.com/AIDASoft/podio/blob/8651fdde64e45896c490fc11f19fe53128c36b82/src/SIOReader.cc#L23-L24 There really should be a check there and if it returns false, then we need to throw a std::runtime_error telling us that we cannot use this file.


The actual work

The main part of the work will be to make sure that we switch files whenever necessary. The main challenge in this case will be that when reading these files, there can be different categories and it is possible that we read some categories much faster than we do others. Hence, it's possible that we run out of entries for one category in a file, while we still have entries for other categories in this file. Obviously, we still need to be able to go back to those entries. This means that we will have to keep track of which entries of each category are in which file and then potentially open another file when the current entry is not in the file that is currently open. We could in principle also keep multiple files open at the same time, but then we have to keep track of all of them and we will potentially have very many open file handles, which is something we would like to avoid.

The steps that we will have to take are:

I think most of the functionality can be put into the track keeping structure, such that the SIOReader can be left almost untouched.

The structure that keeps track of which entry is in which file would look something like this (from an interface point of view)

struct EntryFileMap {
  EntryFileMap(const std::vector<std::string>& filenames, const std::vector<podio::SIOFileTocRecord>& tocRecords);
  std::tuple<sio::ifstream&, unsigned> getFileAndPosition(const std::string& category, unsigned iEntry) const;
};

For the implementation one would probably go with something like a

  std::vector<std::string> m_filenames; ///< All the file names
  sio::ifstream m_currentStream;  ///< The currently open stream / file
  unsigned m_currentStreamIdx;  ///< The index in the filenames vector of the currently open stream
  std::vector<SIOFileTocRecord> m_fileTOCRecords; ///< The toc records of each file

Then getFileAndPosition should be implementable in a fairly straight forward manner:

SaxenaAnushka102 commented 2 weeks ago

Hi @tmadlener, Thanks so much for the detailed explanation! While it sounds like there's a bit more complexity than I initially anticipated, I'm still very interested in working on this and getting some hands-on experience with the SIOReader code. To get started, I had a couple of questions for clarification: In the new structure EntryFileMap, how will it handle situations where an entry might span multiple files? For the openFiles method, is there a specific way we want to handle potential errors during the TOC record reading process for each file? I'm happy to jump in and start working on the structure that keeps track of entries and files (potentially EntryFileMap). Do you have any specific guidance on the implementation approach for this part? Also, are there any project community calls that I can be a part of to know more & discuss this further? Thanks, Anushka.

tmadlener commented 2 weeks ago

Glad to hear it :) And obviously very happy to help along the way. To answer your specific questions:

In the new structure EntryFileMap, how will it handle situations where an entry might span multiple files?

Entries will always be fully within one file, unless there is a problem with the file. However, in that case there is not too much we can do in any case.

For the openFiles method, is there a specific way we want to handle potential errors during the TOC record reading process for each file?

For the first version we simply go the easiest way; If we find an issue in any of the files during TOC reading, we will just throw an exception, as there is again not too much we can do with faulty input files.

Do you have any specific guidance on the implementation approach for this part?

I would start with something a long the lines in my original comment. I am fairly certain it will need some refinement that we will discover throughout the development.

Also, are there any project community calls that I can be a part of to know more & discuss this further?

No, not really at this point.