apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.25k stars 2.4k forks source link

[SUPPORT] No way to clean `archived/` folder #10930

Open bk-mz opened 4 months ago

bk-mz commented 4 months ago

Describe the problem you faced

There's no way to control archived/ folder size and no way to trigger its cleaning.

We have a long running table which accumulated a lot of archives (~100 GB) which now damages cleaner performance and overall performance of ingestion process.

To Reproduce

Steps to reproduce the behavior:

  1. Go to All Configuration in Hudi Site
  2. Check for all settings that control archived/ folder of hudi
  3. Ensure there is none

Expected behavior

There should be a description somewhere in documentation of stating how to upkeep archived/ folder.

Upkeep of archived/ folder should be delegated to cleaner.

Environment Description

Additional context

Related slack thread: https://apache-hudi.slack.com/archives/C4D716NPQ/p1711531654297129

Stacktrace

Add the stacktrace of the error.

bk-mz commented 4 months ago

just a side note, we were re-creating metadata for that table.

danny0405 commented 4 months ago

We have some option such as hoodie.archive.merge.enable for archival log merging, the cleaing is introduced only after 1.0 release.

ad1happy2go commented 4 months ago

Pointing to slack thread also here - https://apache-hudi.slack.com/archives/C4D716NPQ/p1711531654297129

nsivabalan commented 3 months ago

may be we should introduce a ArchivalClean table service to auto clean files older than say 2 months. Not many users are going to inspect archival timeline after 2+ months. and it will avoid accumulating entire history. Interested users can still choose to not clean it up.