apache / iceberg-rust

Apache Iceberg
https://rust.iceberg.apache.org/
Apache License 2.0
634 stars 149 forks source link

Exploring Enhanced Compaction Support in Rust #657

Open amitgilad3 opened 4 weeks ago

amitgilad3 commented 4 weeks ago

This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the FileScanTask class is handled between the two implementations.

In the Java version, the FileScanTask includes :

  1. DataFile object , which provides crucial information about partitions and specId, content. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in the FileScanTask(in this issue https://github.com/apache/iceberg-rust/pull/607#issuecomment-2334603319)
  2. List - which is used to remove the necessary rows from existing files.

I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:

  1. Add the fields DataFile & List to FileScanTask.
  2. Propose a new API - that returns a more informative version (perhaps FileScanPlan?) of FileScanTask, which includes the required data but is not serializable.
  3. Other possible solutions? - I am open to suggestions on alternative approaches.

I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet

Would love to get your input @sdd @Xuanwo & @ZENOTME

compaction-RewriteDataFilesSparkAction Diagram drawio (1)

liurenjie1024 commented 2 weeks ago

Thanks @amitgilad3 for raising this. I think compaction is a relatively complex topic, and we are somehow far from doing this. For example, we don't support reading deletion files, we don't have transaction api. Also compaction typically requires a distributed computing engine to process it. I think a better approach would be to provide necessary primitives in this library, and help other distributed engines to do that?