Open huynguyent opened 8 months ago
Sounds good to me. Only worry is that we won't be able to access the necessary information from the transaction log without using private methods.
The Delta Rust transaction log APIs expose more info. Anyways, I am cool with this, just want to make sure we rely on public interfaces!
Sadly the transaction log information seems to only be exposed in the Scala version, not Python one :(
https://books.japila.pl/delta-lake-internals/DeltaLog/
If we wanna do this in pyspark, we would have to reach into the JVM get this information. Not sure if that would count as public interfaces though. The delta-spark library regularly reach into the JVM from the pyspark side, for example
https://github.com/delta-io/delta/blob/master/python/delta/tables.py#L52C14-L52C18
@huynguyent - you can add this function to levi if you'd like: https://github.com/MrPowers/levi
There is a get_add_actions
method in Delta Rust that probably exposes the necessary details: https://delta-io.github.io/delta-rs/api/delta_table/#deltalake.DeltaTable.get_add_actions
Yup, levi might be a more straightforward place for this feature. I'll raise an isse and look into implementing it
Is your feature request related to a problem? Please describe.
For streaming systems (or batch systems that run in high frequency) that write data into delta tables, it's a common problem to have lots of small files. In many cases, it's not practical to auto compact because of various reasons, for example
One way to solve this is to have a separate process that perform optimization regularly on these delta tables. However it's not a good idea to optimize the entire table whenever without any constraint. A few example reasons:
Describe the solution you'd like A helper function to find out which partitions have been updated between some time period, for example
The
exclude_optimize_operations
flag is necessary because optimization operations themselves are also update operations. If the operations are not excluded, they might cause a feedback loop since they will keep showing up in the output.All the information needed for this features should be available in the transaction log.
Describe alternatives you've considered Optimizing the entire table and accept the overhead
Not sure what's a good alternative once z-order is used however
Additional context
N/A
Willingness to contribute
Would you be willing to contribute an implementation of this feature?