Exploring Enhanced Compaction Support in Rust

This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the FileScanTask class is handled between the two implementations.

In the Java version, the FileScanTask includes :

DataFile object , which provides crucial information about partitions and specId, content. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in the FileScanTask(in this issue https://github.com/apache/iceberg-rust/pull/607#issuecomment-2334603319)
List - which is used to remove the necessary rows from existing files.

I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:

Add the fields DataFile & List to FileScanTask.
Propose a new API - that returns a more informative version (perhaps FileScanPlan?) of FileScanTask, which includes the required data but is not serializable.
Other possible solutions? - I am open to suggestions on alternative approaches.

I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet

Would love to get your input @sdd @Xuanwo & @ZENOTME

compaction-RewriteDataFilesSparkAction Diagram drawio (1)

apache / iceberg-rust

Exploring Enhanced Compaction Support in Rust #657