fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
880 stars 272 forks source link

Glob ** support could be more efficient #217

Open danielnaab opened 5 years ago

danielnaab commented 5 years ago

When globbing paths with **, such as:

s3://my-bucket/**

The underlying fsspec implementation (referenced in #54) will iterate over each "directory" in the bucket, leading to multiple api calls. However, boto3's list_objects_v2 function will by default return all sub-keys with the given prefix, if the delimiter is omitted:

response = s3.list_objects_v2(Bucket="my-bucket", Prefix="")
response["Contents"]  == all keys in the bucket

The above possibility implies there is an optimization available to avoid repeated calls to S3Filesystem._lsdir, which should significantly speed up this operation, by removing the delimiter argument as specified here: https://github.com/dask/s3fs/blob/master/s3fs/core.py#L325 (Understanding, of course, that globbing ** in the middle of the path adds extra complexity)

martindurant commented 5 years ago

Although you are totally right, getting all files below a fixture like this is the minority case, and so we choose to optimise for what I think it more typical: one- or two-level globs. It might be acceptable to add shortcuts that use this call instead within glob and find, so long as the default is maintained.

danielnaab commented 5 years ago

Although you are totally right, getting all files below a fixture like this is the minority case, and so we choose to optimise for what I think it more typical: one- or two-level globs. It might be acceptable to add shortcuts that use this call instead within glob and find, so long as the default is maintained.

Hi @martindurant thanks for the quick response. Yes makes sense, this optimization can only safely be done if the pattern ends in **, or if knowledge of the bucket structure exists... Also, in addition to find, walk would benefit with the default maxdepth=None.

martindurant commented 5 years ago

(except that walk explicitly doesn't work like that, but iterates directory-wise. You would have to have two very separate code paths)

Are you interested in implementing this?

danielnaab commented 5 years ago

Sorry I'm using boto3 directly for my use case, but wanted to share the speed up opportunity. If I can get some time scheduled I'd be happy to revisit and try my hand but I don't have the time ATM.

Also, thinking through this a bit, I don't think there is any circumstances where not using this approach makes sense, as a ** glob in the middle of a path still needs to query for every subdirectory to the right of the **. (Unless I misunderstand the semantics of **.)

martindurant commented 5 years ago

Correct, but glob rarely includes **. I would love to see a PR sometime, I can't promise when I might work on it. Perhaps someone else will. I still maintain that this is a relatively rare usage.