duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
788 stars 70 forks source link

Unable to query tables in DuckDB file created by DBT models #372

Open JRocki opened 3 months ago

JRocki commented 3 months ago

I am not sure if this is intentional or if there's something on my end, but I am unable to query tables in the DuckDB file that are created during the DBT run. I am using AWS S3/parquet as my 'data lake'. I'm using duckdb v 0.10.0 and dbt-duckdb v 1.7.2. Here's my profiles.yml:

db:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: ./db.duckdb
      extensions:
        - httpfs
        - parquet
      use_credential_provider: aws
      settings:
        s3_region: us-west-1

The pipeline runs successfully. I get all the expected parquet files in S3. The curious issue is with the duckdb file.

For example, I have a model that creates a table in a db.duckdb file: db.main.my_model. I also run a DBT test on this model (it checks that values in a column are unique). This creates a table db.main_dbt_test__audit.unique_column_id. I like to then use python/duckdb to directly query the tables when I'm debugging or exploring, but it errors when querying db.main.my_model:

con = duckdb.connect('db.duckdb')
df = con.query(
    """
    select *
    from db.main.my_model
    """
)

yields: duckdb.duckdb.HTTPException: HTTP Error: HTTP GET error on 'https://my-bucket.s3.amazonaws.com/pipeline/my_model.parquet' (HTTP 403)

However, if I run:

con = duckdb.connect('db.duckdb')
df = con.query(
    """
    select *
    from db.main_dbt_test__audit.unique_column_id
    """
)

I am able to see the data. Is there something inherently different about tables create via duckdb models? How does the duckdb file know to point to S3? I think the HTTP 403 is a red herring because the DBT pipeline runs meaning that it is able to connect to AWS. I'd really appreciate any thoughts on this, thanks!

jwills commented 3 months ago

The rub here is that the created DuckDB file won't have the S3 access key/secret present to be able to read the tables that are backed by files in S3 (the result of tests are just regular DuckDB tables, which is why you can query them.) You should be able to read the tables by loading the aws extension and running CALL load_aws_credentials(); to load your credentials.

To avoid needing to do this, you can use the new CREATE SECRET construct via a dbt macro and then DuckDB will store the credentials in an encrypted format in a special . directory in home: https://duckdb.org/docs/sql/statements/create_secret.html

JRocki commented 3 months ago

@jwills That makes sense, thank you!