delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[BUG][Spark] Logs are not compacted #3245

Closed Minashraf closed 5 months ago

Minashraf commented 5 months ago

Bug

Which Delta project/connector is this regarding?

Describe the problem

Logs don't seem to be deleted after retention or after checkpoint

Steps to reproduce

I am using zeppelin notebook and loading my data from HDFS do an update I ran this command many times and I have a lot of checkpoints on hdfs but none of the logs are deleted image Here is my table description image

Observed results

None of the logs are deleted +100 files

Expected results

older logs to be deleted especially the ones before the checkpoint

Further details

Environment information

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

Minashraf commented 5 months ago

Found the solution minimum retention is 1 day and can't be converted to hours or minutes

felipepessoto commented 4 months ago

For reference:

https://github.com/delta-io/delta/blob/97439835a4a667ac2ad86ec6054f0e85e8214760/spark/src/main/scala/org/apache/spark/sql/delta/DeltaConfig.scala#L356C1-L362C28

The shortest duration we have to keep delta files around before deleting them. We can only delete delta files that are before a compaction. We may keep files beyond this duration until the next calendar day.