Open bk-mz opened 4 months ago
just a side note, we were re-creating metadata for that table.
We have some option such as hoodie.archive.merge.enable
for archival log merging, the cleaing is introduced only after 1.0 release.
Pointing to slack thread also here - https://apache-hudi.slack.com/archives/C4D716NPQ/p1711531654297129
may be we should introduce a ArchivalClean table service to auto clean files older than say 2 months. Not many users are going to inspect archival timeline after 2+ months. and it will avoid accumulating entire history. Interested users can still choose to not clean it up.
Describe the problem you faced
There's no way to control
archived/
folder size and no way to trigger its cleaning.We have a long running table which accumulated a lot of archives (~100 GB) which now damages cleaner performance and overall performance of ingestion process.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
There should be a description somewhere in documentation of stating how to upkeep
archived/
folder.Upkeep of archived/ folder should be delegated to cleaner.
Environment Description
Hudi version : 0.14.0
Spark version : 3.4.1
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) :
Additional context
Related slack thread: https://apache-hudi.slack.com/archives/C4D716NPQ/p1711531654297129
Stacktrace
Add the stacktrace of the error.