apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.35k stars 2.42k forks source link

Hoodie clean is not deleting old files for MOR table #7600

Open SabyasachiDasTR opened 1 year ago

SabyasachiDasTR commented 1 year ago

Describe the problem you faced

We are incrementally upserting data into our Hudi table/s every 5 minutes. We have set CLEANER_POLICY as KEEP_LATEST_BY_HOURS with CLEANER_HOURS_RETAINED = 48.

The old delta log files in our partition from 2 months back are still not cleaned and we can see in cli last cleanup happened 2 months back on November. I do not see any action being performed on cleaning the old log files. The only command we execute is Upsert and we have single writer and compaction runs every hour. We think this is causing out emr job to underperform and crash multiple times as very large number of delta log files are getting piled up in the partitions and compaction is trying to read them while processing the job.

MicrosoftTeams-image (33)

MicrosoftTeams-image (34)

Options used during Upsert: HudiOptionsLatest

Writing to s3 Upsertcmd Partition structure: s3://bucket/table/partition/parquet and .log files

Expected behavior As per my understanding the logs should be deleted beyond CLEANER_HOURS_RETAINED which is 2 days .

Environment Description

SabyasachiDasTR commented 1 year ago

@nsivabalan we referred https://github.com/apache/hudi/issues/3739 but we are using different configs for CLEANER_POLICY. Could you please consider as high priority and suggest as this is failing our prod job.

SabyasachiDasTR commented 1 year ago

Also we want to understand what is the impact of executing "cleans run" command from cli manually. We have verified compaction and commits are working for the latest time but cleanup is not triggering automatically after that. If we execute the "cleans run" command from cli manually will it impact the data?

xushiyan commented 1 year ago

@SabyasachiDasTR have you observed any error or warn in logs? it's likely that something is blocking the clean or failing it. Can you search logs and find any statement wrt "clean"? looks like it just stop clean at some point.

yes you can use cli to trigger clean manually. it's the legit tool to perform cleaning. if you want to be cautious, you can perform it against a table clone to try it out. If something is failing the clean, it'll be the same result though. Need to check the logs still.

Duplicate issue https://github.com/apache/hudi/issues/7530

SabyasachiDasTR commented 1 year ago

Hi @xushiyan we enabled hudi debug logging and scanned all the container logs. We did not find any ERROR or WARN logs related to 'clean'. Below are the info logs and looks like it is not able to find the point in time from where it has to clean. What could be the reason?

FYI we did try 'cleans run' command in one of our table and it executed successfully and cleaned lot of files. But the auto clean is still not triggering in any of the tables, that eventually is causing the number of log files to grow.

stderr.2023-01-09-10:2023-01-09T11:47:59.346+0000 [INFO] [1673249876388qa_correlation_id] [org.apache.hudi.client.BaseHoodieWriteClient] [BaseHoodieWriteClient]: Start to clean synchronously. stderr.2023-01-09-10:2023-01-09T11:48:00.062+0000 [INFO] [1673249876388qa_correlation_id] [org.apache.hudi.client.BaseHoodieWriteClient] [BaseHoodieWriteClient]: Scheduling cleaning at instant time :20230109114759346 stderr.2023-01-09-10:2023-01-09T11:48:01.308+0000 [INFO] [1673249876388qa_correlation_id] [org.apache.hudi.table.action.clean.CleanPlanner] [CleanPlanner]: No earliest commit to retain. No need to scan partitions !! stderr.2023-01-09-10:2023-01-09T11:48:01.308+0000 [INFO] [1673249876388qa_correlation_id] [org.apache.hudi.table.action.clean.CleanPlanner] [CleanPlanner]: Nothing to clean here. It is already clean

As per the logs Nothing to clean here. It is already clean , but we do see lot of logs files from 2 months back. I have attached generic logs here. AllErrorLogs.txt

AllWARNLogs.txt

HudiErrorLogs.txt

HudiWARNLogs.txt

SabyasachiDasTR commented 1 year ago

Hi @xushiyan any thought on the above logs?

koochiswathiTR commented 1 year ago

@xushiyan, we are missing our SLA`s badly as the log files are more, And the accumulated data size is morethan 120 TB. Any help is much appreciated.

koochiswathiTR commented 1 year ago

09T11:48:01.308+0000 [INFO] [1673249876388qa_correlation_id] [org.apache.hudi.table.action.clean.CleanPlanner] [CleanPlanner]: Nothing to clean here. It is already clean

looking at this log, CleanPlanner.getPartitionPathsForCleanByCommits is not returning any List back , So cleanup is not triggering.

@xushiyan @nsivabalan Pls help here.

koochiswathiTR commented 1 year ago

Any update on this?

danny0405 commented 1 year ago

cc @nsivabalan, can you take a look?

koochiswathiTR commented 1 year ago

@nsivabalan Any update on this?

nsivabalan commented 1 year ago

sorry. missed from the radar. are you folks in general slack in hudi workspace. lets connect there. we might need to inspect the timeline and see whats going on.

umehrot2 commented 1 year ago

@SabyasachiDasTR @koochiswathiTR The issue here is similar to https://github.com/apache/hudi/issues/3739 . I believe what is happening here is that you are setting CLEANER_HOURS_RETAINED to 2 days. But meanwhile, archival is running more aggressively. By default archival will maintain maximum 30 commits in the active timeline - https://hudi.apache.org/docs/0.11.1/configurations#hoodiekeepmaxcommits. Hence, in your case by the time cleaner is run and its trying to clean up commits older than 2 days, those commits are already archived. And hence cleaner even though it is scheduled, it is not finding anything to clean based on the logs you have provided.

If you want to continue with you current cleaner config, you should set https://hudi.apache.org/docs/0.11.1/configurations#hoodiekeepmaxcommits to be higher than the number of commits you have in a span of 2 days. Essentially, you want to cleaner to run at a higher frequency than archival.

As for cleaning the data, you should disable https://hudi.apache.org/docs/configurations/#hoodiecleanerincrementalmode while running the clean manually. This is needed because in your case, you want to cleaner to go back in time and clean dangling files which are older than last time the cleaner was run.

koochiswathiTR commented 1 year ago

Hi @umehrot2,

Below are the cleanup config changes.
We process the batch in 5 mints interval. 
5 minute ingestion – which is 12 delta commits per hour and 288(12*24) delta commits per day
Compaction runs every hour, In a day 24 commits. 
In a day total number of commits = (Delta commits + compaction commits ) = 312 commits
We configured to retain 3 days of commits 312 *3 = 936 commits
Minimum commits retained is set to 937  ( 936 +1 ) 
Maximum commits retained is 960 (936 + 24) 

HoodieCompactionConfig.CLEANER_POLICY.key() -> HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name(),
    HoodieCompactionConfig.CLEANER_COMMITS_RETAINED.key() -> "936",      
    HoodieCompactionConfig.MIN_COMMITS_TO_KEEP.key() -> "937",  //  CLEANER_COMMITS_RETAINED + 1
    HoodieCompactionConfig.MAX_COMMITS_TO_KEEP.key() -> "960", // CLEANER_COMMITS_RETAINED + 24

Please let us know your thoughts on this.

umehrot2 commented 1 year ago

@koochiswathiTR yes the configs seems fine to me. Let us know if it helped.

ad1happy2go commented 1 year ago

@umehrot2 @koochiswathiTR Were you able to get it resolved with those configs. Please let us know in case you need any other help on this.

victorxiang30 commented 1 year ago

@SabyasachiDasTR @koochiswathiTR The issue here is similar to #3739 . I believe what is happening here is that you are setting CLEANER_HOURS_RETAINED to 2 days. But meanwhile, archival is running more aggressively. By default archival will maintain maximum 30 commits in the active timeline - https://hudi.apache.org/docs/0.11.1/configurations#hoodiekeepmaxcommits. Hence, in your case by the time cleaner is run and its trying to clean up commits older than 2 days, those commits are already archived. And hence cleaner even though it is scheduled, it is not finding anything to clean based on the logs you have provided.

If you want to continue with you current cleaner config, you should set https://hudi.apache.org/docs/0.11.1/configurations#hoodiekeepmaxcommits to be higher than the number of commits you have in a span of 2 days. Essentially, you want to cleaner to run at a higher frequency than archival.

As for cleaning the data, you should disable https://hudi.apache.org/docs/configurations/#hoodiecleanerincrementalmode while running the clean manually. This is needed because in your case, you want to cleaner to go back in time and clean dangling files which are older than last time the cleaner was run.

hi what should I do if my cleaning is OOM after disabling incremental mode