duckdb / duckdb

DuckDB is an analytical in-process SQL database management system
http://www.duckdb.org
MIT License
23.14k stars 1.84k forks source link

Cannot read TSV file from S3 #10779

Closed alexanderluiscampino closed 2 months ago

alexanderluiscampino commented 7 months ago

What happens?

Simply trying to read a TSV file from S3, following examples from the official docs.

Getting:

Traceback (most recent call last):
  File "/workspaces/duckdb-sandbox/main.py", line 5, in <module>
    with duckdb.connect("main", read_only=False) as conn:
  File "/workspaces/duckdb-sandbox/main.py", line 15, in <module>
    conn.execute(f"""
duckdb.duckdb.HTTPException: HTTP Error: HTTP GET error on 'https://the-bucket-dev.s3.amazonaws.com/effingham_cama_export.tsv' (HTTP 400)

First I don't understand the error message and how the S3 URI get's changed in there, might be nothing, might be something. 400 error, sure AWS SDK is terrible at errors, so I don't even know where to start.

I did do a aws s3 ls the-bucket-dev and produced the correct listing, which means the creds are accessible.

I checked the extension httpfs, it is loaded.

I am unsure how to check if duckdb is correctly resolving the credentials for AWS. Also, not sure if there's a debug mode on DuckDB, to see what else can be the issue, by having more information.

I have tried with multiple S3 URIs, same outcome. Also, it takes ~30 seconds for that error to emerge, which leads me to think it is trying to do something.

To Reproduce

import duckdb
s3_uri ="s3://the-bucket-dev/effingham_cama_export.tsv"

with duckdb.connect("main", read_only=False) as conn:
    conn.execute("INSTALL httpfs;")
    conn.execute("LOAD httpfs;")
    conn.execute("""
        CREATE SECRET secret (
            TYPE S3,
            PROVIDER CREDENTIAL_CHAIN,
            REGION 'us-west-2'
        );""")
    conn.execute(f"""
        CREATE TABLE my_table AS
        SELECT * FROM read_csv('{s3_uri}', header=true, sep = '\t') LIMIT 10
    """)

OS:

Linux

DuckDB Version:

0.10.0

DuckDB Client:

Python

Full Name:

Alex Campino

Affiliation:

Zillow

Have you tried this on the latest nightly build?

I have tested with a nightly build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

alexanderluiscampino commented 7 months ago

Update: setting the creds directly in the secret, instead of the provider chain works.

access_key = os.getenv("AWS_ACCESS_KEY_ID")
secret_key = os.getenv("AWS_SECRET_ACCESS_KEY")
session_token = os.getenv("AWS_SESSION_TOKEN")

with duckdb.connect("main", read_only=False) as conn:
    conn.execute("INSTALL httpfs;")
    conn.execute("LOAD httpfs;")
    conn.sql(f"""
        CREATE SECRET secret1 (
            TYPE S3,
            KEY_ID '{access_key}',
            SECRET '{secret_key}',
            SESSION_TOKEN '{session_token}',
            REGION 'us-west-2' );
        """
    )
    conn.sql(f"""
        SELECT * FROM read_csv('{s3_uri}', header=true, sep = '\t')
        LIMIT 10
    """).show()
JRocki commented 7 months ago

I'm ran into something similar. Seems that the provider chain is broken. I posted in Discussions and ned2 posted a solution here:

hey @JRocki, my team ran into what looks like this problem. it looks like the region isn't getting added into the auto-generated S3 endpoint URI (even when this is specified in the REGION parameter of the CREATE SECRET command). we found we could workaround this by explicitly configuring the ENDPOINT using the S3 URI for our region:

CREATE SECRET (
    TYPE S3, 
    PROVIDER CREDENTIAL_CHAIN, 
    PROFILE '<name_of_your_profile>',
    ENDPOINT 's3.<your-region>.amazonaws.com'
)

This does work for me if I remove the PROFILE param.

alexanderluiscampino commented 7 months ago

ah! I can confirm that this workaround works:

conn.execute(f"""
        CREATE SECRET secret (
            TYPE S3,
            PROVIDER CREDENTIAL_CHAIN,
            REGION '{AWS_REGION}',
            ENDPOINT 's3.{AWS_REGION}.amazonaws.com'
        );""")
samansmink commented 7 months ago

@alexanderluiscampino This is a known issue which i have a fix for, but need to fix our deploy pipeline first before I can push it to the aws extension.

thanks @JRocki!

samansmink commented 7 months ago

With https://github.com/duckdb/duckdb_aws/pull/32 some of the endpoint issues should be resolved. Reinstalling the aws extension with: force install aws should give you the updated binary

This issue seems to be slightly different though, but with latest binary I can not reproduce with following setup:

Config ~/.aws/credentials

[default]
aws_access_key_id=<redacted>
aws_secret_access_key=<redacted>

[test1]
aws_access_key_id=<redacted>
aws_secret_access_key=<redacted>

~/.aws/config

[default]
region=eu-west-1

[profile test1]
region=eu-west-1

DuckDB Both

CREATE SECRET s1 (
    TYPE s3,
    PROVIDER credential_chain
)

and

CREATE SECRET s1 (
    TYPE s3,
    PROVIDER credential_chain,
    PROFILE 'test1'
)

seem to work and set the correct region for me and succeed to authenticate

please let me know if the issue is resolved now, or a reproduction if it still persists!

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] commented 2 months ago

This issue was closed because it has been stale for 30 days with no activity.