Open jrbourbeau opened 1 year ago
s3fs caches file listings, so the simpler workaround to get all the lengths of all of the files you need, is to prospectively ls()
/find()
in the right locations beforehand. We can also enable passing the size (+etag, ...) explicitly to open()
if you have that information from elsewhere; I think we talked about this.
Where you need real file-like objects supporting seek()
random access, knowing the size is necessary so that the readahead buffer doesn't attempt to read bytes that don't exist in the target. On the other hand, the best caching strategy for kerchunking HDF5 files I have found to be "first", since that's where the majority of the metadata lives. In that case, knowing the size should not be required and maybe we can do some work to make it a lazy attribute.
(Including the etag is optional, but all open()
calls currently do use it, to make sure the file didn't change during reading)
The fs.open call often takes ~0.5-1.5 seconds to run
Worth mentioning that this value will be higher on the first call, due to time to set up the http session (ssl, etc) and query the bucket location - you would need to pay this latency at some point regardless.
When I open a file on S3 like this:
The
fs.open
call often takes ~0.5-1.5 seconds to run. Here's a snakeviz profile (again, just of thefs.open
call) where it looks like most time is spent in adetails
call that hits S3:I think this is mostly to get the file size (though I'm not sure why the size is needed at file object creation time) because if I pass the file size to
fs.open
, then things are much faster:@martindurant do you have a sense for what's possible here to speed up opening files?
The actual use case I'm interested in is passing a bunch (100k) of netcdf files to Xarray, whose
h5netcdf
engine requires open file objects.