apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.32k stars 2.41k forks source link

[SUPPORT] Why HUDI ConsistentBucketClusteringExecutionStrategy not supported by flink engine? #11636

Open pursuit-wangpz opened 1 month ago

pursuit-wangpz commented 1 month ago

Upon reviewing the source code, it is evident that the ConsistentBucketClusteringExecutionStrategy is only implemented for the Spark engine.

danny0405 commented 1 month ago

Because it's hard for Flink to support both compaction and clustering execution in the same pipeline, current Flink only supports the clustering plan generation for consistnet hashing, a separate clustering job is needed for execution.

pursuit-wangpz commented 1 month ago

Because it's hard for Flink to support both compaction and clustering execution in the same pipeline, current Flink only supports the clustering plan generation for consistnet hashing, a separate clustering job is needed for execution.

However, it seems that org.apache.hudi.sink.clustering.HoodieFlinkClusteringJob does not support ConsistentBucketClusteringExecutionStrategy, which can only be specified with the Spark engine using org.apache.hudi.client.clustering.run.strategy.SparkConsistentBucketClusteringExecutionStrategy. This operation implies that HUDI requires two engines to complete the Consistent Bucket operation: the Flink engine to generate the plan, and the Spark engine to execute the plan.

danny0405 commented 1 month ago

I think so, @beyond1920 can you chim in for more insights?