fsspec / gcsfs

Pythonic file-system interface for Google Cloud Storage
http://gcsfs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
345 stars 146 forks source link

glob performance regression #641

Open mhfrantz opened 2 months ago

mhfrantz commented 2 months ago

When using GCSFileSystem.glob with a pattern like "bucket-name/prefix*suffix", version 2023.9.0 introduced a performance regression. Previously, this glob would be resolved with an efficient API call whose performance was proportional to the number of matching objects. Since 2023.9.0, the performance seems to scale with the number of objects in the bucket. In my system, the buckets have a "flat" pseudo-folder structure with 1e5+ objects.

Debug output from 2023.6.0:

DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
DEBUG:gcsfs.credentials:GCS refresh
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None

Debug output from 2023.9.0 (and more recent versions like 2024.6.0):

DEBUG:asyncio:Using selector: EpollSelector
DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
DEBUG:gcsfs.credentials:GCS refresh
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
[repeated 100+ times]

Perhaps the prefix argument is no longer being specified to the GCS backend (e.g. in GCSFileSystem._list_objects). I've been studying the differences between 2023.6.0 and 2023.9.0 in both this repo and filesystem_spec, but I haven't seen evidence of this change being explicit or intentional. The unit testing of glob seems to be functional, so it wouldn't catch a performance regression.

discretizer commented 3 weeks ago

I ran into the issue too. The crux of he matter seems to be a confluence of a couple of different things:

  1. The current code uses the default glob call in the underlying AbstractFilesystem object which will call find with a maxdepth based on the glob structure
  2. The current implementation of find() doesn't REALLY respect the maxdepth argument. It will return ALL objects under the file path with a _list_objects call and then filter those objects out based on the max_depth . This might be handled better by building a 'fake' glob based on the maxdepth parameter and feeding that into the list objects
  3. Previously there was a glob implementation that called the base glob implementation which then used the base find (which uses a walk based algorithm to only recurse down to the appropriate depth)

https://github.com/fsspec/gcsfs/blob/d7952a946710da486eb7a39fed1d92f3c065c8f1/gcsfs/core.py#L826-L836

Performance under this algorithm was pretty good. Now it's REALLY based on the total number of objects under the glob - which can be really bad.

There are a couple of potential fixes:

discretizer commented 3 weeks ago

The following monkey patch is significantly faster:

import fsspec

from gcsfs import GCSFileSystem
from fsspec.asyn import AsyncFileSystem

base_glob = AsyncFileSystem._glob
base_find = AsyncFileSystem._find
async def glob_patch(self, path, prefix="", **kwargs): 
    if not prefix: 
        # Identify pattern prefixes. Ripped from fsspec.spec.AbstractFileSystem.glob and matches 
        # the glob.has_magic patterns. 
        indstar = path.find("*") if path.find("*") >= 0 else len(path) 
        indques = path.find("?") if path.find("?") >= 0 else len(path) 
        indbrace = path.find("[") if path.find("[") >= 0 else len(path) 

        ind = min(indstar, indques, indbrace) 
        prefix = path[:ind].split("/")[-1] 
    return await base_glob(self, path, prefix=prefix, **kwargs)

async def find_patch(self, path, maxdepth=None, withdirs=False, detail=False, **kwargs):
    return await base_find(self, path, maxdepth=maxdepth, withdirs=withdirs, detail=detail, **kwargs)

GCSFileSystem._glob = glob_patch
GCSFileSystem._find = find_patch
martindurant commented 3 weeks ago

Thanks for looking into this and performing the benchmark. I have a suspicion - which ought to be tested - that you would find the opposite preference for the case that you have very deep directory structure with very few files. At the extreme, this be a single file like "bucket/dir1/dir2/dir3/dir4/dir4/file", where a glob would need to walk as many directories deep as given by the glob pattern (in the walk implementation) or only need one call with one returned item (in the find implementation).