[SUPPORT] No way to clean `archived/` folder

bk-mz commented 4 months ago

Describe the problem you faced

There's no way to control archived/ folder size and no way to trigger its cleaning.

We have a long running table which accumulated a lot of archives (~100 GB) which now damages cleaner performance and overall performance of ingestion process.

To Reproduce

Steps to reproduce the behavior:

Go to All Configuration in Hudi Site
Check for all settings that control archived/ folder of hudi
Ensure there is none

Expected behavior

There should be a description somewhere in documentation of stating how to upkeep archived/ folder.

Upkeep of archived/ folder should be delegated to cleaner.

Environment Description

Hudi version : 0.14.0
Spark version : 3.4.1
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) :

Additional context

Stacktrace

Add the stacktrace of the error.

bk-mz commented 4 months ago

just a side note, we were re-creating metadata for that table.

danny0405 commented 4 months ago

We have some option such as hoodie.archive.merge.enable for archival log merging, the cleaing is introduced only after 1.0 release.

ad1happy2go commented 4 months ago

Pointing to slack thread also here - https://apache-hudi.slack.com/archives/C4D716NPQ/p1711531654297129

nsivabalan commented 3 months ago

may be we should introduce a ArchivalClean table service to auto clean files older than say 2 months. Not many users are going to inspect archival timeline after 2+ months. and it will avoid accumulating entire history. Interested users can still choose to not clean it up.

apache / hudi

[SUPPORT] No way to clean `archived/` folder #10930