delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

AWS environment variables seems to have more priority over storage_options #1700

Closed emanueledomingo closed 3 months ago

emanueledomingo commented 9 months ago

Environment

Delta-rs version: 0.11.0

Binding: Python

Environment:


Bug

What happened:

I'm not able to write on AWS S3 using AssumeRole. Same code is working with deltalake 0.10.1.

What you expected to happen:

Write to AWS S3 using credentials in storage_options instead of environment ones

How to reproduce it:

import deltalake as dl
import pyarrow as pa  # 12.0.0
import boto3

aws_profile = "..."
aws_region = "..."
aws_role_arn = "..."

session = boto3.Session(profile_name=aws_profile, region_name=aws_region")
sts = session.client("sts")
cred = sts.assume_role(RoleArn=aws_role_arn, RoleSessionName="TestSession")["Credentials"]

pa_table = pa.Table.from_pylist([{'n_legs': 2, 'animals': 'Flamingo'}, {'year': 2021, 'animals': 'Centipede'}])
dt_table_uri = "..."  # s3://my-bucket/test/

storage_options = {
    "AWS_S3_ALLOW_UNSAFE_RENAME": "true",
    "AWS_ACCESS_KEY_ID": cred["AccessKeyId"],
    "AWS_SECRET_ACCESS_KEY": cred["SecretAccessKey"],
    "AWS_SESSION_TOKEN": cred["SessionToken"],
    "AWS_REGION": aws_region,
}

os.environ["AWS_ACCESS_KEY_ID"] = "WRONG_TOKEN"
os.environ["AWS_SECRET_ACCESS_KEY"] = "WRONG_TOKEN"
os.environ["AWS_SESSION_TOKEN"] = "WRONG_TOKEN"

dl.write_deltalake(
    table_or_uri=dt_table_uri,
    data=pa_table,
    mode="overwrite",
    storage_options=storage_options,
)

More details:

OSError
in try_get_table_and_table_uri(table_or_uri, storage_options)
    394     raise ValueError("table_or_uri must be a str, Path or DeltaTable")
    396 if isinstance(table_or_uri, (str, Path)):
--> 397     table = try_get_deltatable(table_or_uri, storage_options)
    398     table_uri = str(table_or_uri)
    399 else:

in try_get_deltatable(table_uri, storage_options)
    406 def try_get_deltatable(
    407     table_uri: Union[str, Path], storage_options: Optional[Dict[str, str]]
    408 ) -> Optional[DeltaTable]:
    409     try:
--> 410         return DeltaTable(table_uri, storage_options=storage_options)
    411     except TableNotFoundError:
    412         return None

in DeltaTable.__init__(self, table_uri, version, storage_options, without_files, log_buffer_size)
    228 """
    229 Create the Delta Table from a path with an optional version.
    230 Multiple StorageBackends are currently supported: AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage (GCS) and local URI.
   (...)
    244 
    245 """
    246 self._storage_options = storage_options
--> 247 self._table = RawDeltaTable(
    248     str(table_uri),
    249     version=version,
    250     storage_options=storage_options,
    251     without_files=without_files,
    252     log_buffer_size=log_buffer_size,
    253 )
    254 self._metadata = Metadata(self._table)

OSError: Generic S3 error: response error "<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidAccessKeyId</Code><Message>The AWS Access Key Id you provided does not exist in our records.</Message><AWSAccessKeyId>WRONG_TOKEN</AWSAccessKeyId><RequestId>XXX</RequestId><HostId>XXX</HostId></Error>", after 0 retries: HTTP status client error (403 Forbidden) for url (XXX)

Delta tries to connect using the environment variables instead of the storage options.

giacomorebecchi commented 9 months ago

I haven't deep-dived too much into rust code as I'm still a beginner rustacean, but it seems to me that the issue arises from the following lines: https://github.com/delta-io/delta-rs/blob/4da7d66d06985d386e61d3fb124cad6680594dcc/rust/src/storage/config.rs#L163-L176

The comparison between the env key and the storage_options key is done on the lowered version of the env key, resulting in bugs like the one @emanueledomingo reported.

giacomorebecchi commented 9 months ago

Given the above comment, a temporary workaround is to pass already-lowered keys in the storage_options

r3stl355 commented 8 months ago

Looks like it was introduced here: https://github.com/delta-io/delta-rs/commit/02b3cea58505212454509828660dd534eb2ba6e6.

A possible fix would be using a case-insensitive hash map unless there is a scenario when key case is important (I don't see one). If case insensitive hash map is not a good idea then changing the logic in with... methods to loop through keys performing case-insensitive comparison instead of using contains_key

roeap commented 8 months ago

I haven't checked the codebase fully, but did some work on that recently. Specifically the S3 code is still quite messy. IIRC the priority issue should actually be that if there are multiple credentials available the priority is based on the order in which object store checks credentials.

Unfortunately we cannot just try to get a credential, since there are configuration-less credentials (i.e. metadata endpoint) which means the logic to check if there is a valid credential w/o the environment is a bit more involved. I tried once, but ended up "guessing" too much what the user intend may be, while also not being too familiar with how AWS does things.

That said, we definitely need to do some work on the s3 config...

dispanser commented 6 months ago

removed, different issue, unrelated

NikoJ commented 3 months ago

I stopped not getting this error in deltalake 0.16.1.

emanueledomingo commented 3 months ago

Yes i confrim that also on 0.16.0 the error does not occur. This issue can be closed.