Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.34k stars 164 forks source link

`S3Config.credentials_provider` not used in write path #3367

Open kevinzwang opened 1 day ago

kevinzwang commented 1 day ago

Describe the bug

If you are writing to an S3 bucket and configure your S3 credentials using a user-provided function using S3Config.credentials_provider, it does not currently pass those credentials on to our writer, so it will fail to authenticate.

To Reproduce

The following will fail:

import daft
import datetime
import boto3

def get_credentials():
    session = boto3.Session()
    creds = session.get_credentials()
    return daft.io.S3Credentials(
        key_id=creds.access_key,
        access_key=creds.secret_key,
        session_token=creds.token,
        expiry=datetime.datetime.now(datetime.timezone.utc) + datetime.timedelta(hours=1),
    )

s3_config = daft.io.S3Config(credentials_provider=get_credentials, region_name="us-west-1")
io_config = daft.io.IOConfig(s3=s3_config)

df = daft.from_pydict({"foo": [1, 2, 3]})

df.write_parquet("s3://path/to/bucket.parquet", io_config=io_config)

Expected behavior

Daft should behave the same between reads (which currently work) and writes. It should fetch the credentials from the credentials provider (or cached credentials if already fetched and not expired) and pass it along to the PyArrow writer.

Component(s)

Parquet, CSV, Other

Additional context

Relevant part of the code where we set the PyArrow filesystem credentials for writing: https://github.com/Eventual-Inc/Daft/blob/main/daft/filesystem.py#L215-L235