apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.15k stars 2.14k forks source link

[RewriteDataFiles] add option from-snapshot to support minor compaction #10824

Open xianyouQ opened 1 month ago

xianyouQ commented 1 month ago

Feature Request / Improvement

I propose to add parameters “from-snapshot” to RewriteDataFiles. This parameter can be set to the snapshotId that has not been rewritten recently(20 mins for example) and RewriteDataFiles only rewrite files from “from-snapshot” to the latest snapshot. The most recent snapshot of the iceberg table will always have many small files. If we can quickly process the most recent snapshot during merge optimization, we can significantly reduce read amplification for mor reading. At the same time, this kind of merging can be executed frequently because its execution will be faster. We can also run common rewrite datafile operations after multiple minor rewrites to ensure complete removal of small files and read amplification.

Query engine

Spark

nk1506 commented 1 month ago

@xianyouQ , Frequent compaction can lead to the creation of numerous orphan files and increase the chances of rewriting the same dataset multiple times. However, there is one notable advantage: in the case of partition evolution, compaction can be performed from that particular snapshot, which provides some relief.

Also If the table is being maintained regularly You can always run RewriteDataFiles before MOR reading.

Please share your thoughts, @RussellSpitzer .

RussellSpitzer commented 1 month ago

I think this is a pretty interesting idea. I think it would be pretty useful to have that boundary of "only consider files for compacting written before xxxx" . I'm not sure if we would want snapshot or timestamp there (maybe both?)

The other direction is kind of interesting to me as well but i'm not sure that's really required. The rewrite command already would ignore older files which already been compacted.

xianyouQ commented 1 month ago

Our use case is to divide the merge tasks into minor and full , especially merging the upsert table which is modified frequently. Minor will be executed frequently to quickly merge the recently written snapshots, and full will be executed after merging every few minors. I'm wondering if this is a good practice.

Please share your thoughts, @RussellSpitzer @nk1506

xianyouQ commented 1 month ago

"increase the chances of rewriting the same dataset multiple times"

As RussellSpitzer said, the rewrite command with from-snapshot would ignore files that have been compacted.

I think this is unavoidable for tables that are frequently modified, because even if the dataset has been compacted, as the previous data is subsequently updated, the associated delete file become larger and larger, and the dataset needs to be merged again at this time. In our use case, we would set delete-file-threshold to a valid value.