fsspec / gcsfs

Pythonic file-system interface for Google Cloud Storage
http://gcsfs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
320 stars 141 forks source link

Token refresh does not work #627

Closed dbalabka closed 3 days ago

dbalabka commented 1 week ago

The problem is related to #32. Unfortunately, it is not obvious to implement token refresh when a library is used by another library (dask -> pyarrow -> fsspec -> gcsfs). It would be amazing to implement more prone token refresh.

Here is an exception that we get when the job runs longer than an hour: gcsfs version is 2023.12.2.post1 HttpError: Invalid Credentials, 401

File /opt/conda/lib/python3.10/site-packages/dask_expr/_expr.py:3727, in _execute_task()

File /opt/conda/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py:97, in __call__()

File /opt/conda/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py:645, in read_parquet_part()

File /opt/conda/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py:646, in <listcomp>()

File /opt/conda/lib/python3.10/site-packages/dask/dataframe/io/parquet/arrow.py:641, in read_partition()

File /opt/conda/lib/python3.10/site-packages/dask/dataframe/io/parquet/arrow.py:1774, in _read_table()

File /opt/conda/lib/python3.10/site-packages/dask/dataframe/io/parquet/arrow.py:264, in _read_table_from_path()

File /opt/conda/lib/python3.10/site-packages/pyarrow/parquet/core.py:341, in __init__()

File ~/.../.venv/lib/python3.10/site-packages/pyarrow/_parquet.pyx:1250, in pyarrow._parquet.ParquetReader.open()
   1248 
   1249         with nogil:
-> 1250             check_status(builder.Open(self.rd_handle, properties, c_metadata))
   1251 
   1252         # Set up metadata

File ~/.../.venv/lib/python3.10/site-packages/pyarrow/types.pxi:88, in pyarrow.lib._datatype_to_pep3118()
     86 """
     87 try:
---> 88     char = _pep3118_type_map[type.id()]
     89 except KeyError:
     90     return None

File /opt/conda/lib/python3.10/site-packages/fsspec/spec.py:1844, in read()

File /opt/conda/lib/python3.10/site-packages/fsspec/caching.py:69, in _fetch()

File /opt/conda/lib/python3.10/site-packages/gcsfs/core.py:1850, in _fetch_range()

File /opt/conda/lib/python3.10/site-packages/fsspec/asyn.py:118, in wrapper()

File /opt/conda/lib/python3.10/site-packages/fsspec/asyn.py:103, in sync()

File /opt/conda/lib/python3.10/site-packages/fsspec/asyn.py:56, in _runner()

File /opt/conda/lib/python3.10/site-packages/gcsfs/core.py:1027, in _cat_file()

File /opt/conda/lib/python3.10/site-packages/gcsfs/core.py:437, in _call()

File /opt/conda/lib/python3.10/site-packages/decorator.py:221, in fun()

File /opt/conda/lib/python3.10/site-packages/gcsfs/retry.py:158, in retry_request()

File /opt/conda/lib/python3.10/site-packages/gcsfs/retry.py:123, in retry_request()

File /opt/conda/lib/python3.10/site-packages/gcsfs/core.py:430, in _request()

File /opt/conda/lib/python3.10/site-packages/gcsfs/retry.py:112, in validate_response()

HttpError: Invalid Credentials, 401
dbalabka commented 1 week ago

It seems there was an attempt to fix the problem in #486, but it was unsuccessful.

dbalabka commented 1 week ago

To refresh the token, we are calling the following sequence of methods on each request:

  1. GCSFileSystem.def _request()
  2. GCSFileSystem._get_headers()
  3. GCSFileSystem.credentials.apply()
  4. GCSFileSystem.credentials.maybe_refresh()

My suspicion is that GoogleCredentials.credentials == None is not obligatory when we deal with anon auth. Hence, we have to keep an explicit token type identifier.

dbalabka commented 3 days ago

It happens only if provide token string w/o refresh_token which is expected behavior:

def get_storage_options():
    # It works, but I don't know if it's the best way to do it
    #
    # Also, seems like need to specify the scope, although worked before.
    # Source: https://stackoverflow.com/questions/60401040/getting-invalid-scope-when-attempting-to-obtain-a-refresh-token-via-the-google-a
    credentials, _ = google.auth.default(
        scopes=["https://www.googleapis.com/auth/cloud-platform"]
    )
    if not credentials.valid:
        credentials.refresh(Request())
    return {
        "token": credentials.token,
    }

pd.read_parquet(..., storage_options=get_storage_options())

Relying on any other auth method fixes the problem.