Cache error for snapshots on top of parquet files

matt-winkler commented 1 year ago

Describe the bug

When running a dbt snapshot on top of an underlying parquet data source, there is a challenge introduced in particular when columns are added / removed at the source. Note that this environment is NOT running with Unity Catalog yet. I'm not sure if that has an impact, but feels relevant to mention.

Steps To Reproduce

Create a source on a parquet table in cloud storage / the lake
Run dbt snapshot
Update the columns in the source table
Run dbt snapshot again and observe the error message below

NOTE: If dbt snapshot is run a THIRD time, it works. This makes it a bit hard to understand the error message's reference to "restarting the cluster", because that doesn't seem to be strictly necessary.

Expected behavior

Snapshot works on the second run.

Screenshots and log output

Error while reading file s3://udemy-sd-classification/sd_classification/course_subcategory/part-00000-850de77d-0923-422b-8607-ae6f83e9a29e-c000.snappy.parquet. [DEFAULT_FILE_NOT_FOUND] It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.

System information

The output of dbt --version:

1.5.5

The operating system you're using: dbt Cloud

benc-db commented 10 months ago

Have you tried with Delta (just to try to scope the bug)? I'm unable to get the error you're seeing, but then again, I'm trying to recreate in the dbt test harness. I get a different error instead, related to missing column, and I get it regardless of file format.

ktopcuoglu commented 7 months ago

hey @benc-db , We don't see the error on Delta sources. And it's not a column modification issue. Table metadata stays the same. Here is the whole story.

Steps To Reproduce

Create a source on a parquet table in cloud storage / the lake ( ParquetTable1 )
Run dbt snapshot (just for initially creating a snapshot table)
Run dbt snapshot a. dbt-snapshot materialization creates view ParquetTable1__dbt_tmp b. dbt-snapshot materialization runs a MERGE statement.. target: ParquetTable1_snapshot source: ParquetTable__dbt_tmp (view)

between 3.a. and 3.b. another process (not in Databricks) re-populates the table ParquetTable1 (with SparkSQL, insert-override). The underlying s3 parquet files are changing as expected. It will be a bit hard to reproduce the issue because we need to modify/remove the s3 file exactly between 3.a. and 3.b. and there are a few seconds only :)

more context;

ParquetTable1 is not a partitioned table, refreshed hourly with a non-databricks spark Job.
We are using hive_metastore.
dbt-databricks working with DB SQL Warehouse.
we tried to add refresh table ParquetTable1 as a pre-hook command to the snapshot, but it didn't help as it ran before 3.a.

cc/ @matt-winkler

github-actions[bot] commented 1 month ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.

databricks / dbt-databricks