delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.3k stars 404 forks source link

cannot read from public GCS bucket if non logged in #2859

Closed lostmygithubaccount closed 1 month ago

lostmygithubaccount commented 1 month ago

Environment

Delta-rs version: deltalake==0.19.2

Binding: Python

Environment:


Bug

What happened:

I have a public GCS bucket with a bunch of Delta Lake tables. the bucket has viewer access for allUsers, meaning unauthenticated users can access it. you can easily test this with pandas or other libraries (I hit this with Ibis):

[ins] In [1]: import gcsfs

[ins] In [2]: import pandas as pd

[ins] In [3]: from deltalake import DeltaTable

[nav] In [4]: pd.read_parquet("gs://ibis-analytics/penguins.parquet", storage_options={"token": "anon"})
Out[4]:
       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year
0       Adelie  Torgersen            39.1           18.7              181.0       3750.0    male  2007
1       Adelie  Torgersen            39.5           17.4              186.0       3800.0  female  2007
2       Adelie  Torgersen            40.3           18.0              195.0       3250.0  female  2007
3       Adelie  Torgersen             NaN            NaN                NaN          NaN    None  2007
4       Adelie  Torgersen            36.7           19.3              193.0       3450.0  female  2007
..         ...        ...             ...            ...                ...          ...     ...   ...
339  Chinstrap      Dream            55.8           19.8              207.0       4000.0    male  2009
340  Chinstrap      Dream            43.5           18.1              202.0       3400.0  female  2009
341  Chinstrap      Dream            49.6           18.2              193.0       3775.0    male  2009
342  Chinstrap      Dream            50.8           19.0              210.0       4100.0    male  2009
343  Chinstrap      Dream            50.2           18.7              198.0       3775.0  female  2009

[344 rows x 8 columns]

[nav] In [5]: pd.read_csv("gs://ibis-analytics/penguins.csv", storage_options={"token": "anon"})
Out[5]:
       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year
0       Adelie  Torgersen            39.1           18.7              181.0       3750.0    male  2007
1       Adelie  Torgersen            39.5           17.4              186.0       3800.0  female  2007
2       Adelie  Torgersen            40.3           18.0              195.0       3250.0  female  2007
3       Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN  2007
4       Adelie  Torgersen            36.7           19.3              193.0       3450.0  female  2007
..         ...        ...             ...            ...                ...          ...     ...   ...
339  Chinstrap      Dream            55.8           19.8              207.0       4000.0    male  2009
340  Chinstrap      Dream            43.5           18.1              202.0       3400.0  female  2009
341  Chinstrap      Dream            49.6           18.2              193.0       3775.0    male  2009
342  Chinstrap      Dream            50.8           19.0              210.0       4100.0    male  2009
343  Chinstrap      Dream            50.2           18.7              198.0       3775.0  female  2009

[344 rows x 8 columns]

but trying to read a Delta Lake table in the same place -- if not authenticated with GCP it seems -- results in an error:

[ins] In [6]: DeltaTable("gs://ibis-analytics/penguins.delta", storage_options={"token": "anon"})
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[6], line 1
----> 1 DeltaTable("gs://ibis-analytics/penguins.delta", storage_options={"token": "anon"})

File ~/repos/ibis-analytics/.venv/lib/python3.12/site-packages/deltalake/table.py:380, in DeltaTable.__init__(self, table_uri, version, storage_options, without_files, log_buffer_size)
    360 """
    361 Create the Delta Table from a path with an optional version.
    362 Multiple StorageBackends are currently supported: AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage (GCS) and local URI.
   (...)
    377
    378 """
    379 self._storage_options = storage_options
--> 380 self._table = RawDeltaTable(
    381     str(table_uri),
    382     version=version,
    383     storage_options=storage_options,
    384     without_files=without_files,
    385     log_buffer_size=log_buffer_size,
    386 )

OSError: Generic GCS error: Error performing token request: Error after 10 retries in 8.200125916s, max_retries:10, retry_timeout:180s, source:error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?audience=https%3A%2F%2Fwww.googleapis.com%2Foauth2%2Fv4%2Ftoken)

there's not a lot in the stacktrace to go on

What you expected to happen:

above works

How to reproduce it:

You can try it out on the bucket noted above: gs://ibis-analytics has penguins.csv, penguins.parquet, and penguins.delta in it

More details:

this was reproduced by others as well

ion-elgreco commented 1 month ago

We simply use the object store crate in Rust, if it's not working then it's because anon is not a supported config, you should try asking this in arrow-rs where object store belongs

lostmygithubaccount commented 1 month ago

is there any documentation on what is supported in storage_options?

ion-elgreco commented 1 month ago

is there any documentation on what is supported in storage_options?

You can find that here: https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html