dbt-labs / dbt-spark

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks
https://getdbt.com
Apache License 2.0
400 stars 227 forks source link

Unable to make incremental model work #163

Closed unarmedcivilian closed 3 years ago

unarmedcivilian commented 3 years ago

Hi, I'm having trouble executing models with materialzed=incremental in dbt with Spark. I'm using EMR with Spark 2.4.5 dbt versions are all latest `dbt --version installed version: 0.19.1 latest version: 0.19.1

Up to date!

Plugins:

This is my config {{config( materialized='incremental', file_format='parquet', partition_by=['part_id'] ) }}

I am using custom schemas (ignoring the schema in ~/.dbt/profiles.yml) using the get_custom_schema macro {% macro generate_schema_name(schema_name,node) -%} {{ generate_schema_name_for_env(schema_name,node) }} {%- endmacro %}

On each run, dbt tries to create the table again even if it already exists, and the run fails.

2021-05-03 14:54:52.027792 (MainThread): Database Error 2021-05-03 14:54:52.031598 (MainThread): org.apache.spark.sql.AnalysisException: Table some_schema.some_table already exists. You need to drop it first.;

I suspect it might be due to adapter.get_relation not being able to find the schema, I checked earlier issues and found this (https://github.com/fishtown-analytics/dbt-spark/pull/42) but it doesn't seem to help my case.

jtcohen6 commented 3 years ago

Hey @unarmedcivilian! I think you probably have the right instinct here:

I suspect it might be due to adapter.get_relation not being able to find the schema

This would explain why dbt thinks it needs to create the table, even if it already exists. Is there any way you could try running the same metadata query as dbt (show table extended in show_schema like '*') to see if your table appears there?

I haven't been able to reproduce this, but I also don't have an EMR cluster to test against, just the dockerized Spark2 cluster available in this repo. All the same, I'll include reproduction details below in case it's helpful.

I include the generate_schema_name macro as you do above, and then define my model incremental like so:

{{ config(
    materialized='incremental',
    file_format='parquet',
    partition_by=['part_id'],
    schema='some_schema'
) }}

select 1 as part_id, true as is_that_so
union all
select 2 as part_id, false as is_that_so

I run dbt twice with no problem. Here's the SQL it runs the first time (from logs/dbt.log, with some whitespace cleanup):

create table some_schema.incremental
    using parquet
    partitioned by (part_id)
    as
select 1 as part_id, true as is_that_so
union all
select 2 as part_id, false as is_that_so

And the second:

create temporary view incremental__dbt_tmp as
select 1 as part_id, true as is_that_so
union all
select 2 as part_id, false as is_that_so;

insert into table some_schema.incremental
    select `is_that_so`, `part_id` from incremental__dbt_tmp
unarmedcivilian commented 3 years ago

Thanks for the detailed response @jtcohen6 . Sorry for not getting back earlier, the query show table extended in show_schema like '*' did not return anything. However it worked with a newer release version of EMR, and I am using ephemeral clusters for running the job so I've decided to just upgrade.