Open mhfrantz opened 2 months ago
I ran into the issue too. The crux of he matter seems to be a confluence of a couple of different things:
glob
call in the underlying AbstractFilesystem
object which will call find
with a maxdepth based on the glob structure find()
doesn't REALLY respect the maxdepth
argument. It will return ALL objects under the file path with a _list_objects
call and then filter those objects out based on the max_depth
. This might be handled better by building a 'fake' glob based on the maxdepth
parameter and feeding that into the list objects glob
implementation that called the base glob
implementation which then used the base find
(which uses a walk
based algorithm to only recurse down to the appropriate depth) Performance under this algorithm was pretty good. Now it's REALLY based on the total number of objects under the glob - which can be really bad.
There are a couple of potential fixes:
glob
find
- build a fake glob from maxdepth
parameter and feed that (with a delimeter) into GCS using matchGlob
glob
using matchGlob
The following monkey patch is significantly faster:
import fsspec
from gcsfs import GCSFileSystem
from fsspec.asyn import AsyncFileSystem
base_glob = AsyncFileSystem._glob
base_find = AsyncFileSystem._find
async def glob_patch(self, path, prefix="", **kwargs):
if not prefix:
# Identify pattern prefixes. Ripped from fsspec.spec.AbstractFileSystem.glob and matches
# the glob.has_magic patterns.
indstar = path.find("*") if path.find("*") >= 0 else len(path)
indques = path.find("?") if path.find("?") >= 0 else len(path)
indbrace = path.find("[") if path.find("[") >= 0 else len(path)
ind = min(indstar, indques, indbrace)
prefix = path[:ind].split("/")[-1]
return await base_glob(self, path, prefix=prefix, **kwargs)
async def find_patch(self, path, maxdepth=None, withdirs=False, detail=False, **kwargs):
return await base_find(self, path, maxdepth=maxdepth, withdirs=withdirs, detail=detail, **kwargs)
GCSFileSystem._glob = glob_patch
GCSFileSystem._find = find_patch
Thanks for looking into this and performing the benchmark. I have a suspicion - which ought to be tested - that you would find the opposite preference for the case that you have very deep directory structure with very few files. At the extreme, this be a single file like "bucket/dir1/dir2/dir3/dir4/dir4/file", where a glob would need to walk as many directories deep as given by the glob pattern (in the walk
implementation) or only need one call with one returned item (in the find
implementation).
When using
GCSFileSystem.glob
with a pattern like"bucket-name/prefix*suffix"
, version 2023.9.0 introduced a performance regression. Previously, thisglob
would be resolved with an efficient API call whose performance was proportional to the number of matching objects. Since 2023.9.0, the performance seems to scale with the number of objects in the bucket. In my system, the buckets have a "flat" pseudo-folder structure with 1e5+ objects.Debug output from 2023.6.0:
Debug output from 2023.9.0 (and more recent versions like 2024.6.0):
Perhaps the
prefix
argument is no longer being specified to the GCS backend (e.g. inGCSFileSystem._list_objects
). I've been studying the differences between 2023.6.0 and 2023.9.0 in both this repo andfilesystem_spec
, but I haven't seen evidence of this change being explicit or intentional. The unit testing ofglob
seems to be functional, so it wouldn't catch a performance regression.