delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.31k stars 1.64k forks source link

In batch processing, how does delta get the details of incremental changes? #614

Open Lianjzh opened 3 years ago

Lianjzh commented 3 years ago

I want to get the data of data warehouse dwd details change. How to implement it in batch processing? I don’t find the related API. By comparing the difference between two snapshots, the cost will be very high, or I can parse the add File in the metadata corresponding to the latest history. Is there a better way to get the changed data?

Lianjzh commented 3 years ago

If there is no API, does delta Lake plan to support incremental queries in batch processing in the future?

yijiacui-db commented 3 years ago

deltaLog has an api called "getChanges". https://github.com/delta-io/delta/blob/3e0885618524d93f43c382763f4ec13c6a081893/src/main/scala/org/apache/spark/sql/delta/DeltaLog.scala#L223 . You can pass in the start version and get the list of version with corresponding add files.

tdas commented 3 years ago

@Lianjzh @yijiacui-db you could do that but beware that this is an internal API with no guarantees on stability. currently there is no public API to expose those changes. The only real "stable" API is the delta log itself! You could read the json files in the _delta_log directory, parse them (using spark ;) ) to find out the files that were added, and then process them. This would be stable because the log protocol is stable.