AIDASoft / podio

PODIO
GNU General Public License v3.0
24 stars 60 forks source link

Support for globbing in `podio::CreateDataFrame`, `podio::makeReader` and `podio.reading.get_reader` #686

Open Zehvogel opened 3 weeks ago

Zehvogel commented 3 weeks ago

tl; dr:

import podio

# breaks
# reader = podio.reading.get_reader("some/path/*.edm4hep.root")

# works
reader = podio.root_io.Reader("some/path/*.edm4hep.root")

The ROOTReader was (and technically still is) able to read from multiple files at once using globbing as the names are passed to TChain directly

https://github.com/AIDASoft/podio/blob/46b02cfaa00b40f1356a1670860bd57ee533b0cd/src/ROOTReader.cc#L240-L251

Using the new Reader interface this does not work any more as every path is passed to TFile first

https://github.com/AIDASoft/podio/blob/46b02cfaa00b40f1356a1670860bd57ee533b0cd/src/Reader.cc#L38-L44

This is particularly annoying when using the podio DataSource as the normal RDataFrame behaviour is also to pass everything to TChain. I know this was probably never explicitly supported but it would be great to get this back :)

tmadlener commented 3 weeks ago

Do you know when this last worked? Because, no recent change comes to my mind that would alter this behavior. I don't think podio.reading.get_reader every supported globbing, because that effectively opens whatever is passed as a filename directly via TFile.Open (which is also where I suppose this is breaking): https://github.com/AIDASoft/podio/blob/46b02cfaa00b40f1356a1670860bd57ee533b0cd/python/podio/reading.py#L38-L40


It probably is slightly confusing, but the c++ "interface" podio::Reader is completely distinct from anything that happens in python. The podio.root_io.Reader is the podio::ROOTReader https://github.com/AIDASoft/podio/blob/46b02cfaa00b40f1356a1670860bd57ee533b0cd/python/podio/root_io.py#L13-L28

Zehvogel commented 3 weeks ago

Sorry maybe my description was too cryptic.

It probably is slightly confusing, but the c++ "interface" podio::Reader is completely distinct from anything that happens in python.

I did not know that but it also does not matter. In the end both the podio.reading.get_readerand podio::Reader::makeReader do not support globbing and even for the same reason (using TFile). I just choose python for an easier example. The podio.root_io.Reader aka. podio::ROOTReader always did support globbing and still does.

However, only podio::ROOTReader supporting globbing does not help me if I want to use the DataSource though.

tmadlener commented 3 weeks ago

Right, I see where you are coming from. I am nevertheless removing the Regression from the issue title, since neither get_reader nor makeReader, nor the DataSource ever supported globbing, at least not in the version in which they finally landed in podio.

The problem with supporting globbing from the c++ side is, that there we would have to effectively implement our own wrapper around glob for that. For the python bindings we could probably hide a glob.glob in a thin wrapper somewhere.