Open danielnaab opened 5 years ago
Although you are totally right, getting all files below a fixture like this is the minority case, and so we choose to optimise for what I think it more typical: one- or two-level globs. It might be acceptable to add shortcuts that use this call instead within glob and find, so long as the default is maintained.
Although you are totally right, getting all files below a fixture like this is the minority case, and so we choose to optimise for what I think it more typical: one- or two-level globs. It might be acceptable to add shortcuts that use this call instead within glob and find, so long as the default is maintained.
Hi @martindurant thanks for the quick response. Yes makes sense, this optimization can only safely be done if the pattern ends in **
, or if knowledge of the bucket structure exists... Also, in addition to find
, walk
would benefit with the default maxdepth=None
.
(except that walk explicitly doesn't work like that, but iterates directory-wise. You would have to have two very separate code paths)
Are you interested in implementing this?
Sorry I'm using boto3 directly for my use case, but wanted to share the speed up opportunity. If I can get some time scheduled I'd be happy to revisit and try my hand but I don't have the time ATM.
Also, thinking through this a bit, I don't think there is any circumstances where not using this approach makes sense, as a **
glob in the middle of a path still needs to query for every subdirectory to the right of the **
. (Unless I misunderstand the semantics of **
.)
Correct, but glob rarely includes **. I would love to see a PR sometime, I can't promise when I might work on it. Perhaps someone else will. I still maintain that this is a relatively rare usage.
When globbing paths with
**
, such as:The underlying fsspec implementation (referenced in #54) will iterate over each "directory" in the bucket, leading to multiple api calls. However, boto3's
list_objects_v2
function will by default return all sub-keys with the given prefix, if the delimiter is omitted:The above possibility implies there is an optimization available to avoid repeated calls to
S3Filesystem._lsdir
, which should significantly speed up this operation, by removing the delimiter argument as specified here: https://github.com/dask/s3fs/blob/master/s3fs/core.py#L325 (Understanding, of course, that globbing**
in the middle of the path adds extra complexity)