Open amitgilad3 opened 1 month ago
Thanks @amitgilad3 for raising this. I think compaction is a relatively complex topic, and we are somehow far from doing this. For example, we don't support reading deletion files, we don't have transaction api. Also compaction typically requires a distributed computing engine to process it. I think a better approach would be to provide necessary primitives in this library, and help other distributed engines to do that?
This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the
FileScanTask
class is handled between the two implementations.In the Java version, the
FileScanTask
includes :DataFile
object , which provides crucial information about partitions andspecId
,content
. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in theFileScanTask
(in this issue https://github.com/apache/iceberg-rust/pull/607#issuecomment-2334603319)I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:
FileScanTask
.FileScanPlan
?) ofFileScanTask
, which includes the required data but is not serializable.I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet
Would love to get your input @sdd @Xuanwo & @ZENOTME