[SUPPORT] Slow upsert performance for Flink upsert

dheemanthgowda commented 1 month ago

Describe the problem you faced

We were experiencing slow upsert performance when using Hudi with Flink SQL on AWS S3. Tried enabling metadata table, which improved update speed, but the cleaner is not triggering even after 3 commits.

To Reproduce

Steps to reproduce the behavior:

Configure Hudi with the following settings for upserting data via Flink SQL:

'connector' = 'hudi'
'write.operation' = 'upsert'
'write.tasks' = '800'
'table.type' = 'MERGE_ON_READ'
'index.type' = 'BUCKET'
'hoodie.bucket.index.num.buckets' = '10'
'hoodie.index.bucket.engine' = 'SIMPLE'
'hoodie.clean.automatic' = 'true'
'hoodie.cleaner.parallelism' = '200'
'clean.policy' = 'KEEP_LATEST_COMMITS'
'clean.async.enabled' = 'true'
'hoodie.keep.max.commits' = '20'
'hoodie.keep.min.commits' = '6'
'clean.retain_commits' = '3'
'hoodie.datasource.write.hive_style_partitioning' = 'true'
'hoodie.parquet.compression.codec' = 'snappy'
'compaction.max_memory' = '30000'
'hoodie.write.set.null.for.missing.columns' = 'true'
'hoodie.archive.automatic' = 'false'
'hoodie.archive.async' = 'false'
'hoodie.schema.on.read.enable' = 'true'
'hoodie.fs.atomic_creation.support' = 's3a'
'compaction.async.enabled' = 'false'
'compaction.delta_commits' = '1'
'compaction.schedule.enabled' = 'true'
'compaction.trigger.strategy' = 'num_commits'
'hoodie.cleaner.incremental.mode' = 'false'
'hoodie.compaction.logfile.size.threshold' = '1'
'metadata.enabled' = 'false'
'hoodie.compaction.strategy' = 'org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy'

Run a batch job to perform upserts. Monitor logs for cleaning operations. Expected behavior

We expect the cleaner to trigger and remove older commits as per the defined configuration.

Environment Description

Hudi version: 1.14.1 Flink version: 1.17.1 Storage (HDFS/S3/GCS..): S3 Running on Docker? (yes/no): running on K8s Additional context After enabling metadata.enabled to true, we observed a notable improvement in upsert speed. However, the cleaner does not seem to be functioning as expected. Are we missing any configs?

2024-09-15 14:48:38,399 WARN  org.apache.hudi.config.HoodieWriteConfig                     [] - Increase hoodie.keep.min.commits=6 to be greater than hoodie.cleaner.commits.retained=20 (there is risk of incremental pull missing data from few instants based on the current configuration). The Hudi archiver will automatically adjust the configuration regardless.
2024-09-15 14:48:38,909 INFO  org.apache.hudi.metadata.HoodieBackedTableMetadataWriter     [] - Latest deltacommit time found is 20240915143952010, running clean operations.
2024-09-15 14:48:39,153 INFO  org.apache.hudi.client.BaseHoodieWriteClient                 [] - Scheduling cleaning at instant time :20240915143952010002
2024-09-15 14:48:39,160 INFO  org.apache.hudi.table.action.clean.CleanPlanner              [] - No earliest commit to retain. No need to scan partitions !!
2024-09-15 14:48:39,160 INFO  org.apache.hudi.table.action.clean.CleanPlanActionExecutor   [] - Nothing to clean here. It is already clean

ad1happy2go commented 1 month ago

Thanks for raising @dheemanthgowda . Can you also update the subject please.

There is one more issue raised before which explains your issue also. - https://github.com/apache/hudi/issues/11436

danny0405 commented 1 month ago

@dheemanthgowda Thanks for the feedback, it looks like your table does not have partitioning fields, then each compaction would triger a whole table rewrite which is indeed costly for streaming ingestion. Did you try to move the compaction out as a separate job.

ad1happy2go commented 1 week ago

@dheemanthgowda Were you able to check on it more by using aync compaction?

apache / hudi

[SUPPORT] Slow upsert performance for Flink upsert #12046