duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
788 stars 70 forks source link

Can not read external file from s3 private bucket #371

Open Aaron-Zhou opened 3 months ago

Aaron-Zhou commented 3 months ago
  1. Create a private bucket from AWS S3, set environment variables.
  2. Run by dbt run -s ./models
  3. Got error Python model failed: HTTP Error: HTTP GET error on 'https://bucket.s3.amazonaws.com/ms/distribution/csv/orders.csv' (HTTP 403)
  4. My codes:

profiles.yml:

default:
  outputs:
    dev:
      type: duckdb
      path: db/dbt.duckdb
      extensions:
        - httpfs
        - parquet
        - aws
      settings:
        s3_region: "us-east-1"
        s3_access_key_id: "{{ env_var('S3_KEY') }}"
        s3_secret_access_key: "{{ env_var('S3_SECRET') }}"
  target: dev

sources.yml:

sources:
  - name: csv
    meta:
      external_location: "s3://bucket/ms/distribution/csv/{name}.csv"
    tables:
      - name: orders
      - name: customers  

models.py:

def model(dbt, session):
    orders = (
        dbt.source("csv", "orders").filter("status='PENDING_PAYMENT'").set_alias("o")
    )
    customers = dbt.source("csv", "customers").set_alias("c")
    pending_orders = orders.join(customers, "o.customer_id=c.id")
    return pending_orders
  1. Some findings: If I persist the s3 secret to duckdb home folder - "~/.duckdb/stored_secrets/" then the dbt run can work without error. It seems the dbt-duckdb did not actually feed s3 credentials to duckdb.
jwills commented 3 months ago

It would be surprising if the problem was that we weren't feeding the s3 credentials to DuckDB given that so many people run with those settings (that is, I feel like I would have heard that complaint from lots of people if that was broken.)

The most common problem I see with these settings is that the region isn't configured correctly (i.e., the data lives in an S3 bucket defined in a region that is different from the one defined in s3_region.) Of course, if you have a way to make this work using the CREATE SECRET construct (which it sounds like you do), you can just run that in a macro that is configured using the on-run-start hook and be on your way.

Aaron-Zhou commented 3 months ago

@jwills Thanks for your reply, I double checked my s3_region and I am pretty sure the value is correct. I also did some testing by adding below config to write files to s3 bucket, it worked without using CREATE SECRET construct. So it proved that my s3 settings were correct. I also don't believe DuckDB has problem with reading files from s3, but could be something from dbt-duckdb adapter caused s3 settings not proper configured on DuckDB so caused reading from s3 bucket not working. Below is the config I used so write table to external files on S3 bucket. Again, writing worked without the CREATE SECRET construct, but reading was not working.

Model.py


def model(dbt, session):
    dbt.config(
        materialized="external",
        location="s3://bucket/parquet/duckdb-sample/pending_orders.parquet",
        format="parquet",
        options={"partition_by": "customer_id", "overwrite_or_ignore": True},
        glue_register=True,
        glue_database="duckdb_sample",
    )
    orders = (
        dbt.source("csv", "orders").filter("status='PENDING_PAYMENT'").set_alias("o")
    )
    customers = dbt.source("csv", "customers").set_alias("c")
    pending_orders = orders.join(customers, "o.customer_id=c.id")
    return pending_orders