Open ertanden opened 1 year ago
@ertanden A fix has been submitted for this recently to skip clustering for a single input filegroup.
You will have to set these 2 configurations accordingly in order for this to take effect:
hoodie.clustering.plan.strategy.single.group.clustering.enabled=false
hoodie.clustering.plan.strategy.sort.columns=""
Note that hoodie.clustering.plan.strategy.sort.columns
will need to be null or empty.
p.s. I noticed that the configuration pages hasn't been updated, which is why this feature isn't documented in the official docsite.
@voonhous thanks for the information, very helpful. Indeed, it would be nice to have this documented. I think the issue can be closed once the documentation is in place.
On the other side, I thought that this should be the default behavior. The extra configuration needed seems a bit sketchy. For example it is recommended to have a sortable key etc for clustering to work better, but then to prevent constant file re-writes we need to disable the sort columns?
Sorry, I may not have enough information about the internals how clustering works, but just trying to express what makes sense... I just feel that this default behavior is not optimized or even buggy.
@ertanden, I agree that these 2 configurations are conflicting with each other.
Clustering is usually performed together with sorting. And to disable single file group clustering, we are giving up sorting.
Should offline clustering be enabled as a scheduled service, this would mean that these 2 "features" are mutually exclusive of each other, you can have one, but not both.
As such, there's a PR here that is working on improving this: https://github.com/apache/hudi/pull/8760
IMO, if there's a single group that is considered for clustering, sorting shouldn't be considered as sorting will not yield any added reasonable/significant read performance on a file that is predetermined to be "small" anyways.
Hence, I do agree that the default behaviour can be improved.
Describe the problem you faced
We have an append mode COW table, where we sink messages incoming from a Kafka topic. Table is partitioned by day.
Clustering is enabled and we also have cleaner configured with KEEP_LATEST_FILE_VERSIONS for retaining only 1 file version.
The problem is whenever clustering is triggered, although the previous day partitions already has just a single file and no new commits in those old partitions, there's still a
replacecommit
periodically applied. So it seems like it copies the single file unnecessarily to a new one and then the cleaner gets rid of the old file. I tried setting the clustering filter mode to RECENT_DAYS, but it still does the same thing for the partition of yesterday.You can see down below the timeline for the partition from yesterday
2023-07-04
.Am I missing something? Is there some configuration I need to make to prevent this? Or is this a bug?
Thanks!
Expected behavior
I don't expect a
replacecommit
when there's only a single file anyways in the partition without any new commits.Environment Description
Hudi version : 0.13.1
Flink version : 1.16.2
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : yes, kubernetes
Additional context