Closed jpivarski closed 4 months ago
So, I'm going to try to write up my performance study in https://github.com/scikit-hep/uproot5/issues/1157 to collect all in one spot, but yes fsspec has cat_ranges
which allows to bypass this particular multi-open issue. I also looked into caching open file handles in #54 but I think this is really unreliable because the server disallows opening a file for writing if any reader is connected, and this cache leaks reader handles. To do it right we need to use (async or regular) context managers for all file access.
Is there a way to use XRootD in a lightweight, stateless way (like HTTP connections)?
I spent a bit of time this morning reading the protocol doc https://xrootd.slac.stanford.edu/doc/dev56/XRdv520.pdf and I don't see any obvious way to interact with an xrootd server in a stateless way: one needs to acquire an opaque char[4] fHandle
in kXR_open
to then pass as an argument to kXR_read
.
I think this is really unreliable because the server disallows opening a file for writing if any reader is connected
Does anyone actually write ROOT files directly over XRootD? I understand why fsspec-xrootd
would want to support it but for uproot a keyword argument on the filesystem class that enables the cache and prevents writing might be enough, at least as a short term fix.
It is now technically possible, but I don't think it's a good idea. Writing a ROOT file involves a lot of seeking back and forth (they can't be written directly from beginning to end, unless the sizes of everything that is to be written is known in advance), and that would mean a lot of interaction over the network.
Since it was a requested feature, we can't break it, but we don't need to ensure that it is the optimal path. If reopening the file is necessary for writing but not for reading, that would be fine.
Writing files in general over xrootd is a very desired feature. For example, I am writing several GB of parquet files to FNAL EOS storage in my skim example . It works quite well. I would hope we extend uproot writing to support fsspec sinks, using the simplecache local cache feature to only write (commit) the whole file at the end of the writing process.
All that said, I am happy to re-start work on #54
That's just the thing: the Parquet format is defined in such a way that all metadata that needs to know the sizes of things gets written after (at larger seek values) than the data it represents. With causal knowledge of only the past, it can be written from the beginning to the end of the file, in order. That can't be done with the ROOT format, especially if the file is to be valid between writing individual objects and if sizes of everything that will be written isn't known in advance.
Reported by @chrisburr in https://github.com/scikit-hep/uproot5/issues/1157#issuecomment-1979731811_:
In my experience, these
File
objects are heavy; slow to open. fsspec'scat
interface is stateless, so it seems that you have to create a new one of these for every call, but that means every TBasket in Uproot.Is there an alternative that we can use, some
multi_cat
or a context that holds theFile
object so that we don't need so many? Is there a way to use XRootD in a lightweight, stateless way (like HTTP connections)?