fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
892 stars 274 forks source link

Opening lots of files can be slow #816

Open jrbourbeau opened 1 year ago

jrbourbeau commented 1 year ago

When I open a file on S3 like this:

import fsspec

fs = fsspec.filesystem('s3', anon=True)
path = "coiled-datasets/uber-lyft-tlc/part.93.parquet"
fs.open(path, mode="rb")

The fs.open call often takes ~0.5-1.5 seconds to run. Here's a snakeviz profile (again, just of the fs.open call) where it looks like most time is spent in a details call that hits S3:

Screenshot 2023-10-26 at 4 24 31 PM

I think this is mostly to get the file size (though I'm not sure why the size is needed at file object creation time) because if I pass the file size to fs.open, then things are much faster:

Screenshot 2023-10-26 at 4 24 46 PM

@martindurant do you have a sense for what's possible here to speed up opening files?

The actual use case I'm interested in is passing a bunch (100k) of netcdf files to Xarray, whose h5netcdf engine requires open file objects.

martindurant commented 1 year ago

s3fs caches file listings, so the simpler workaround to get all the lengths of all of the files you need, is to prospectively ls()/find() in the right locations beforehand. We can also enable passing the size (+etag, ...) explicitly to open() if you have that information from elsewhere; I think we talked about this.

Where you need real file-like objects supporting seek() random access, knowing the size is necessary so that the readahead buffer doesn't attempt to read bytes that don't exist in the target. On the other hand, the best caching strategy for kerchunking HDF5 files I have found to be "first", since that's where the majority of the metadata lives. In that case, knowing the size should not be required and maybe we can do some work to make it a lazy attribute.

(Including the etag is optional, but all open() calls currently do use it, to make sure the file didn't change during reading)

martindurant commented 1 year ago

The fs.open call often takes ~0.5-1.5 seconds to run

Worth mentioning that this value will be higher on the first call, due to time to set up the http session (ssl, etc) and query the bucket location - you would need to pay this latency at some point regardless.