duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
797 stars 69 forks source link

Intermediate Relation Fails with External Materialization with Partitions #232

Open jesseginsberg opened 11 months ago

jesseginsberg commented 11 months ago

Hello, I'm running into an issue when using the external materialization with the "PARTITION_BY" option.

The dbt run goes ok through the part where it writes out the partitioned files to the file system. But it fails when it reaches the step where it creates the intermediate_relation view within DuckDB. I believe it is the section here: https://github.com/jwills/dbt-duckdb/blob/master/dbt/include/duckdb/macros/materializations/external.sql#L51-L55.

My dbt model called test_model looks something like:

{{ config(
    materialized='external', 
    location="~/dbt_duckdb/local_partitioned_out",
    options={"PARTITION_BY": "date"}
) }}

select 
  date,
  other_columns 
from {{ ref('upstream_model') }}

Looking at the logs, after dbt writes out the partitioned files externally dbt tries to create a view like this one:

create or replace view "db"."main"."test_model__dbt_int" as (
        select * from '~/dbt_duckdb/local_partitioned_out'
    );

But that CREATE VIEW query fails. I believe this intermediate relation expects to be pointing a single file. Because there are multiple files broken out into multiple directories, I think the query should instead be something like:

create or replace view "db"."main"."test_model__dbt_int" as (
        select * from read_parquet('~/dbt_duckdb/local_partitioned_out/**')
    );

Please let me know if my description is clear or if there's any additional info that would be helpful. Thanks!

jwills commented 11 months ago

@jesseginsberg this might seem stupid, but does it work if you do the options as the (lowercase) partition_by instead of the uppercase PARTITION_BY?

The logic we use for constructing the external read location is defined here and it's case-sensitive in the config option name it's looking for: https://github.com/jwills/dbt-duckdb/blob/master/dbt/adapters/duckdb/impl.py#L136

jesseginsberg commented 11 months ago

@jwills well that was an easy solve 😄. Yes, when I use partition_by instead of PARTITION_BY everything runs successfully for me.

Thanks for the quick response!

harshil4076 commented 5 months ago

Hi! Stumbled upon this issue when searching for partition_by. It helped me solve my issue as well. Thanks! Now I can create new partitioned parquet and I also want to update the parquet files as well. I am interested in using COPY orders TO 'orders' (FORMAT PARQUET, PARTITION_BY (year, month), OVERWRITE_OR_IGNORE, FILENAME_PATTERN "orders_{i}"). Basically I want to update the parquet file periodically. does the current dbt-duckdb version support that?

jwills commented 5 months ago

IIRC you can specify any set of key-value pairs that you like in the options dict and we will copy them over to the clause we append to the COPY <tbl> TO '<filepath>' ... expression; you might need to be a bit careful with the filename_pattern one tho b/c it's not in the set of known options that we quote for you automatically.

jwills commented 5 months ago

Any boolean options (like overwrite_or_ignore) will need a dummy true value too-- like 1-- to work correctly. You can see the "logic" we employ for this here: https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/impl.py#L108

harshil4076 commented 5 months ago

Thanks @jwills. options={"partition_by": "date","overwrite_or_ignore": 1} worked.