delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.24k stars 1.63k forks source link

Delta Stream overwrite possibility #1231

Open Iabhishekkothari opened 2 years ago

Iabhishekkothari commented 2 years ago

I'm using spark stream to append data to delta table,but i need only the latest data(data of latest file recieved in each partition). As stream doesn't support overwrite, is there any work around? Can we keep only the latest files in each partition and vacuum rest. Is there any way to use overwrite in spark stream? Urgent help needed

allisonport-db commented 1 year ago

Sorry for the delay. Spark streaming doesn't support overwrite mode explicitly. To be clear, you want to retrieve/save only the data from the latest committed file on a per-partition basis correct?

For overwriting on a per-partition basis, we are releasing support for dynamic partition overwrite in Delta 2.0 which allows you to selectively overwrite only the partitions with data being written into them. However, we do recommend using this cautiously and validating which partitions your data touches to avoid unintentional data loss. To do this more safely, you can use replaceWhere.

You can then use ForeachBatch to perform the overwrite.