duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
816 stars 69 forks source link

Is it possible to reference tables as parquet files in iceberg on glue via external source? #184

Open dioptre opened 1 year ago

dioptre commented 1 year ago

I see a method to write to glue, but no reference to use it to get the data back out?

Please help!

jwills commented 1 year ago

Mmm I've started working on iceberg as a source using an AWS catalog here via a plugin mechanism: https://github.com/jwills/dbt-duckdb/blob/master/dbt/adapters/duckdb/plugins/iceberg.py and then used here like this: https://github.com/jwills/dbt-duckdb/blob/master/tests/functional/plugins/test_iceberg.py

...but it's still pretty early/experimental. Can you tell me a bit more about what you're up to, and I'll see if I can figure out a way to help?

dioptre commented 1 year ago

Yes we have a db schema we pipe to glue that we want to do some testing/analysis on.

The db updates every few hours with an iceberg merge, so we need just the latest, to pipe into duck, and we want to then run dbt transformations on it.

jwills commented 1 year ago

Okay-- so in theory, this will work:

  1. pip3 install pyiceberg[glue,s3fs,pyarrow]
  2. pip3 install git+https://github.com/jwills/dbt-duckdb.git (b/c I haven't cut a release with this functionality enabled yet)
  3. Setup the environment variables for pyiceberg to use as specified here: https://py.iceberg.apache.org/configuration/
  4. Configure the dbt profiles.yml with the iceberg plugin enabled, which would look something like this:
jaffle_shop:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: jaffle_shop.duckdb
      threads: 4
      plugins:
        - module: iceberg
           config:
              catalog: default
  1. and then finally define a sources: config entry in a dbt schema.yml file that references the iceberg plugin and specifies the catalog/name of the table to load in the iceberg_table config property sort of like it's done here: https://github.com/jwills/dbt-duckdb/blob/master/tests/functional/plugins/test_iceberg.py#L8

...and yeah, in theory, that should load the latest iceberg snapshot in its entirety as a pyarrow table into DuckDB for subsequent transformation by the rest of the pipeline. (But again, this is super-new, I haven't tried it for real, maybe take a look at dbt-athena and see if that meets your needs, etc., etc.)