Open tmadlener opened 3 months ago
Hello @tmadlener! I'd love to work on this. Could you please provide more details on the expected implementation? From my understanding, the new functionality will involve defining methods in SIOReader.h
and implementing them in SIOReader.cc
for automatic file switching. Does this approach align with what's expected?
Hi @SaxenaAnushka102, this will involve quite a bit of c++ and after having had another look into it, it might not be as straight forward as I had originally anticipated. Nevertheless, let me lay out the basic pieces of work that would need to be done and then you can still decide if you want to embark on this :)
The main work will have to happen in the SIOReader
indeed. Then there are a few minor things in other places to remove some "artificial" restrictions. We will also fix a small issue along the way
The main building blocks of the SIOReader are:
sio::ifstream
which is just a typedef for an std::ifstream
filestream from where we read the data.SIOFileTOCRecord
that bascially just contains a bunch of file positions. We have different Frame categories in podio and each category can have arbitrarily many entries in the file. The SIOFileTOCRecord
simply maps all category names to all file positions for all entries.There are a few more members, but we don't care too much about them here.
The SIOReader
has a bool readFileTOCRecord()
, that reads a table of content (TOC) record which is then used throughout to jump to different positions inside the file. However, we never check whether this function call actually succeeds in openFile
.
https://github.com/AIDASoft/podio/blob/8651fdde64e45896c490fc11f19fe53128c36b82/src/SIOReader.cc#L23-L24
There really should be a check there and if it returns false, then we need to throw a std::runtime_error
telling us that we cannot use this file.
The main part of the work will be to make sure that we switch files whenever necessary. The main challenge in this case will be that when reading these files, there can be different categories and it is possible that we read some categories much faster than we do others. Hence, it's possible that we run out of entries for one category in a file, while we still have entries for other categories in this file. Obviously, we still need to be able to go back to those entries. This means that we will have to keep track of which entries of each category are in which file and then potentially open another file when the current entry is not in the file that is currently open. We could in principle also keep multiple files open at the same time, but then we have to keep track of all of them and we will potentially have very many open file handles, which is something we would like to avoid.
The steps that we will have to take are:
SIOFileTOCRecord
, but instead of just returning the file position, it should return the correct ifstream
and the position in that file. This will be populated once at the beginning, such that it can then simply be queried afterwards.openFiles
method to the SIOReader
that takes a const std::vector<std::string>&
(aka the filenames) as inputs.
openFile
method in terms of this one to avoid code duplication (this is also done, e.g. for the ROOTReader
)openFiles
we would simply open each file once, read its SIOFileTOCRecord
and use that to populate the structure described in the first step. This would require that the readFileTOCRecord
method returns a newly constructed SIOFileTOCRecord
instead of populating an internal one.I think most of the functionality can be put into the track keeping structure, such that the SIOReader
can be left almost untouched.
The structure that keeps track of which entry is in which file would look something like this (from an interface point of view)
struct EntryFileMap {
EntryFileMap(const std::vector<std::string>& filenames, const std::vector<podio::SIOFileTocRecord>& tocRecords);
std::tuple<sio::ifstream&, unsigned> getFileAndPosition(const std::string& category, unsigned iEntry) const;
};
For the implementation one would probably go with something like a
std::vector<std::string> m_filenames; ///< All the file names
sio::ifstream m_currentStream; ///< The currently open stream / file
unsigned m_currentStreamIdx; ///< The index in the filenames vector of the currently open stream
std::vector<SIOFileTocRecord> m_fileTOCRecords; ///< The toc records of each file
Then getFileAndPosition
should be implementable in a fairly straight forward manner:
Hi @tmadlener,
Thanks so much for the detailed explanation! While it sounds like there's a bit more complexity than I initially anticipated, I'm still very interested in working on this and getting some hands-on experience with the SIOReader
code.
To get started, I had a couple of questions for clarification:
In the new structure EntryFileMap
, how will it handle situations where an entry might span multiple files?
For the openFiles
method, is there a specific way we want to handle potential errors during the TOC record reading process for each file?
I'm happy to jump in and start working on the structure that keeps track of entries and files (potentially EntryFileMap
). Do you have any specific guidance on the implementation approach for this part? Also, are there any project community calls that I can be a part of to know more & discuss this further?
Thanks,
Anushka.
Glad to hear it :) And obviously very happy to help along the way. To answer your specific questions:
In the new structure
EntryFileMap
, how will it handle situations where an entry might span multiple files?
Entries will always be fully within one file, unless there is a problem with the file. However, in that case there is not too much we can do in any case.
For the openFiles method, is there a specific way we want to handle potential errors during the TOC record reading process for each file?
For the first version we simply go the easiest way; If we find an issue in any of the files during TOC reading, we will just throw an exception, as there is again not too much we can do with faulty input files.
Do you have any specific guidance on the implementation approach for this part?
I would start with something a long the lines in my original comment. I am fairly certain it will need some refinement that we will discover throughout the development.
Also, are there any project community calls that I can be a part of to know more & discuss this further?
No, not really at this point.
Currently the
SIOReader
does not support automatic file switching.