Open Lianjzh opened 3 years ago
If there is no API, does delta Lake plan to support incremental queries in batch processing in the future?
deltaLog has an api called "getChanges". https://github.com/delta-io/delta/blob/3e0885618524d93f43c382763f4ec13c6a081893/src/main/scala/org/apache/spark/sql/delta/DeltaLog.scala#L223 . You can pass in the start version and get the list of version with corresponding add files.
@Lianjzh @yijiacui-db you could do that but beware that this is an internal API with no guarantees on stability. currently there is no public API to expose those changes. The only real "stable" API is the delta log itself! You could read the json files in the _delta_log
directory, parse them (using spark ;) ) to find out the files that were added, and then process them. This would be stable because the log protocol is stable.
I want to get the data of data warehouse dwd details change. How to implement it in batch processing? I don’t find the related API. By comparing the difference between two snapshots, the cost will be very high, or I can parse the add File in the metadata corresponding to the latest history. Is there a better way to get the changed data?