fsspec / gdrivefs

Google drive implementation of fsspec
BSD 2-Clause "Simplified" License
38 stars 20 forks source link

Support for Shared Drives #40

Open rhunwicks opened 8 months ago

rhunwicks commented 8 months ago

Currently, gdrivefs doesn't support shared drives.

I have a setup like:

    root_folder: str = "gdrive://Discovery Folder/Worksheets"
    storage_options: dict = {
        "token": "service_account",
        "access": "read_only",
        "creds": json.loads(os.environ["GOOGLE_APPLICATION_CREDENTIALS"]),
        "root_file_id": "0123456789ABCDEFGH",
    }

If I attempt to access that file (using commit 2b48baa11d1697401c914e5ff239dbab4d9c8f71), I get the error:

FileNotFoundError: Directory 0123456789ABCDEFGH has no child named Discovery Folder

  File "./pipelines/assets/base.py", line 210, in original_files
    with p.fs.open(p.path, mode="rb") as f:
  File "./lib/python3.10/site-packages/fsspec/spec.py", line 1295, in open
    f = self._open(
  File "./lib/python3.10/site-packages/gdrivefs/core.py", line 249, in _open
    return GoogleDriveFile(self, path, mode=mode, **kwargs)
  File "./lib/python3.10/site-packages/gdrivefs/core.py", line 270, in __init__
    super().__init__(fs, path, mode, block_size, autocommit=autocommit,
  File "./lib/python3.10/site-packages/fsspec/spec.py", line 1651, in __init__
    self.size = self.details["size"]
  File "./lib/python3.10/site-packages/fsspec/spec.py", line 1664, in details
    self._details = self.fs.info(self.path)
  File "./lib/python3.10/site-packages/fsspec/spec.py", line 662, in info
    out = self.ls(path, detail=True, **kwargs)
  File "./lib/python3.10/site-packages/gdrivefs/core.py", line 174, in ls
    files = self._ls_from_cache(path)
  File "./lib/python3.10/site-packages/fsspec/spec.py", line 372, in _ls_from_cache
    raise FileNotFoundError(path)

The root_file_id is set to the folder id of a GDrive Shared Drive (i.e. https://support.google.com/a/users/answer/7212025?hl=en).

As per https://developers.google.com/drive/api/guides/enable-shareddrives#:~:text=The%20supportsAllDrives%3Dtrue%20parameter%20informs,require%20additional%20shared%20drive%20functionality. we need to set supportsAllDrives=True and includeItemsFromAllDrives=True when calling files.list in order for the API client to find the files.

In my case, if I change the existing:

    def _list_directory_by_id(self, file_id, trashed=False, path_prefix=None):
        all_files = []
        page_token = None
        afields = 'nextPageToken, files(%s)' % fields
        query = f"'{file_id}' in parents  "
        if not trashed:
            query += "and trashed = false "
        while True:
            response = self.service.list(q=query,
                                         spaces=self.spaces, fields=afields,
                                         pageToken=page_token,
                                         ).execute()
            for f in response.get('files', []):
                all_files.append(_finfo_from_response(f, path_prefix))
            more = response.get('incompleteSearch', False)
            page_token = response.get('nextPageToken', None)
            if page_token is None:
                break
        return all_files

to

    def _list_directory_by_id(self, file_id, trashed=False, path_prefix=None):
        all_files = []
        page_token = None
        afields = 'nextPageToken, files(%s)' % fields
        query = f"'{file_id}' in parents  "
        if not trashed:
            query += "and trashed = false "
        while True:
            response = self.service.list(
                q=query,
                spaces=self.spaces, fields=afields,
                pageToken=page_token,
                includeItemsFromAllDrives=True,  # Required for shared drive support
                supportsAllDrives=True,    # Required for shared drive support
            ).execute()
            for f in response.get('files', []):
                all_files.append(_finfo_from_response(f, path_prefix))
            more = response.get('incompleteSearch', False)
            page_token = response.get('nextPageToken', None)
            if page_token is None:
                break
        return all_files

(note the change in the call to self.service.list)

then my code works, and the filesystem can find the file and open it successfully.

I am happy to prepare an MR, but you would need to decide whether you are happy for me to enable shared drive support in all cases, or whether you want to control it via storage_options. And if via storage_options whether it should default to off (completely backwards compatible) or on (may show new files to existing users with shared drives that they don't currently get returned from gdrivefs).

rhunwicks commented 8 months ago

Actually, I see there was already a request for this in #26.

martindurant commented 8 months ago

YEs, exactly so - I believe this is well worth adding, but I am unsure how to expose the possibility to users. I believe simply checking all possible drives every time is probably a substantial slowdown, but I am happy to be told otherwise.

rhunwicks commented 8 months ago

@martindurant when you say "checking all possible drives" do you mean in the drives property, or in _list_directory_by_id?

I've only just started using gdrivefs, but it seems that you need to specify an exact path from the root folder set in the storage options, so I don't think enabling shared drives universally would be any slower - if you don't set the shared drive folder (or one of its subfolders) as the root_drive_id in storage_options then the filesystem won't be searching it.

And the mechanism that finds the exact file id executes one request/response per path segment, so the performance of that seems to be dependent on how many levels deep your path is from the root_folder_id rather than how many other folders there are that don't match the path.