duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
882 stars 78 forks source link

Using use_credential_provider: aws with instance profiles gives HTTP error 400 #226

Open sacundim opened 1 year ago

sacundim commented 1 year ago

When trying to use the aws target in the linked profile either from a ECS container or an EC2 instance that's known to have the correct permissions, we get nevertheless an HTTP 400 error:

05:37:56  Runtime Error in model biostatistics_deaths (models/biostatistics/staging/biostatistics_deaths.sql)
05:37:56    HTTP Error: HTTP GET error on '/?encoding-type=url&list-type=2&prefix=biostatistics.salud.pr.gov%2Fdeaths%2Fparquet_v2%2F' (HTTP 400)

But if in the same EC2 instance I instead configure it this way, with credentials I get from aws sts get-session-token, it works:

    aws:
      type: duckdb
      extensions:
        - httpfs
        - parquet
      threads: 4
      external_root: "{{ env_var('OUTPUT_ROOT') }}"
      settings:
        s3_region: us-west-2
        s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
        s3_session_token: "{{ env_var('S3_SESSION_TOKEN') }}"
jwills commented 1 year ago

huh, k-- these kinds of issues are very hard for me to debug since I don't have ready access to the environment in question; I think the best bet here is to open a python shell/run a simple script that calls the relevant function in dbt-duckdb (which is defined here) and see if we can deduce where the error is coming from, e.g.:

import dbt.adapters.duckdb.credentials as creds
creds._load_aws_credentials()
sacundim commented 1 year ago

...working on it, I've put together a simple Docker image to try out your approach, gotta get it running in AWS Batch to do the real deal

sacundim commented 1 year ago

Running on Batch prints out a dict with keys s3_access_key_id, s3_secret_access_key, s3_session_token and s3_region. The values are sensitive so I obv can't share them. I did launch a duckdb 0.8.1 manually outside of AWS, did the corresponding SET statements, and I can query from there, so it's something in between. I'll try to extend my Python program somehow to test out more of the bits in between those two parts that work.

sacundim commented 1 year ago

I tried the following inside a Fargate container:

import dbt.adapters.duckdb.credentials as creds
import duckdb

credentials = creds._load_aws_credentials()
print(f'credentials keys = {credentials.keys()}')
connection = duckdb.connect()
cursor = connection.cursor()

cursor.execute('INSTALL httpfs')
cursor.execute('LOAD httpfs')
for key, value in credentials.items():
    cursor.execute(f"SET {key} = '{value}'")

...and ran a query like the one my DBT project gets the error for, but it works fine. Maybe elsewhere the adapter is doing something that interferes with this? I looked at e.g. DuckDBConnectionWrapper but I can't spot anything untoward.

jwills commented 1 year ago

Hrm-- maybe related to this? https://github.com/duckdb/duckdb/issues/6563

sacundim commented 1 year ago

My apologies, turns out my reproduction efforts failed to reproduce one of the elements of the original failure: the jobs with the errors are running in ECS cluster with EC2 nodes, but my earlier reproduction attempts ran in Fargate.

I see this perhaps crucial difference:

  1. Under Fargate, the _load_aws_credentials() call returns four keys: s3_access_key_id, s3_secret_access_key, s3_session_token, and s3_region
  2. Under EC2, it only returns three keys—the s3_region is missing!

And I can reproduce the HTTP 400 outside of AWS by not setting the s3_region.

jwills commented 1 year ago

Ah, good to know-- and nice detective work!

jwills commented 1 year ago

Thinking I should add some logging in that _load_aws_credentials function to note which keys were set via the sts token call (tho obviously not the values) to help future folks track down these kinds of problems

jwills commented 1 year ago

...and also that it's possible that this extension may run into some of the same issues: https://github.com/duckdblabs/duckdb_aws

sacundim commented 1 year ago

I've just confirmed a working workaround for the issue:

      use_credential_provider: aws
      settings:
        # In theory this shouldn't be necessary:
        s3_region: "{{ env_var('S3_REGION') }}"
sacundim commented 1 year ago

...and also that it's possible that this extension may run into some of the same issues: https://github.com/duckdblabs/duckdb_aws

Actually I think we have a bug in the httpfs extension here. Its requests to the S3 endpoint are erroring with inscrutable errors in a scenario where other tools—most notably example boto3 the official AWS CLI—work fine. I wonder e.g. if it's sending an empty string for the region when it's supposed to either send none or send a valid one.