databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
221 stars 120 forks source link

Randomly fails on writing parquet table to a external location #771

Open sugendran opened 2 months ago

sugendran commented 2 months ago

Describe the bug

We've started getting these errors in our DBT pipeline with no changes to the pipeline itself. It's been running for a while now without any problem. Not sure on how to debug this.

01:48:16    Runtime Error in model dim_suppliers (models/marts/core/dim_suppliers.sql)
  CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory s3://ordermentum-data/publish/production/core/dim_suppliers. To allow overwriting the existing non-empty directory, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS' to true.

The actual table it fails on changes with each run.

Our config is:

+materialized: table
+file_format: parquet
+location_root: "{{ env_var('publishLocation', 's3://ordermentum-data/publish/dev') ~ '/core' }}"

System information

The output of dbt --version:

from the query run in the sql datawarehouse

"app": "dbt", "dbt_version": "1.8.5", "dbt_databricks_version": "1.8.5", "databricks_sql_connector_version": "3.1.2",
benc-db commented 2 months ago

Recommend filing a ticket with Databricks to understand your options. I suspect there has been a Databricks Runtime change, and if you are using an AP Cluster, you can set this feature on the cluster, but if you are using a SQL Warehouse, might need to take a different approach.

sugendran commented 2 months ago

Is there a way I can wrap the drop table and create table in a transaction?

benc-db commented 2 months ago

Unfortunately no transaction support in Databricks at this time. I think the core issue though is that Databricks treats external locations as unmanaged, i.e. that if you have files in an external location, Databricks does not take the action to delete the files, and the failure is because it is unpredictable what will happen if writing to a location that already has files. If the files that exist there are just those written by this process though, that is another reason to file a Databricks ticket, to get an understanding of what has changed in the runtime to disrupt the execution of your workflow.

sugendran commented 1 month ago

@benc-db any reason this is just for delta? https://github.com/databricks/dbt-databricks/blob/main/dbt/include/databricks/macros/materializations/table.sql#L20

If it also did the same logic for parquet files then we could use the CREATE OR REPLACE to create the table

benc-db commented 1 month ago

@sugendran that line exists because a delta table cannot 'create or replace' a non-delta table, so the old table needs to be dropped. With parquet in an external location, I'm not sure whether this will fix the issue. Have you tried it? My instinct is that since the storage is external (i.e. not managed) the table getting dropped will still leave the existing files, yielding the same issue as above.