apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.35k stars 2.42k forks source link

[SUPPORT] AWS Glue 4.0 taking to long to write to S3 #10916

Open tvxer opened 6 months ago

tvxer commented 6 months ago

Describe the problem you faced

I ran a test reading 1 day worth of data vs 30 days worth of data from the external table to s3 in hudi format, and the amount of time it takes are roughly the same. I have also changed the write operation from upsert to insert, and it made no difference.

The file is not that large, each day has about 1m records from the external table. In s3, each file has an average size of 25MB partition by the date and hour. The whole process is copying and writing to s3, so there aren't any transformation process in between beside changing the data type.

Is there anything I can do to improve the speed?

hudi_options = {
    'hoodie.table.name': table_name,
    "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
    'hoodie.datasource.write.recordkey.field': 'recordkey1, recordkey2',
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.datasource.write.operation': 'insert',
    'hoodie.datasource.write.precombine.field': 'timestamp',
    'hoodie.datasource.write.partitionpath.field': 'date, hour',
    'hoodie.datasource.write.hive_style_partitioning': 'true', 

    'hoodie.datasource.hive_sync.partition_fields': 'date, hour',
    'hoodie.datasource.hive_sync.enable': 'true',
    "hoodie.datasource.hive_sync.mode":"hms",
    'hoodie.datasource.hive_sync.sync_as_datasource': 'false',
    'hoodie.datasource.hive_sync.database': database_name,
    'hoodie.datasource.hive_sync.table': table_name,
    'hoodie.datasource.hive_sync.use_jdbc': 'false',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.hive_sync.support_timestamp': 'true',

    'hoodie.cleaner.policy': 'KEEP_LATEST_FILE_VERSIONS',
    'hoodie.clustering.plan.strategy.max.bytes.per.group': '107374182400',
    'hoodie.clustering.plan.strategy.max.num.groups': '1'

}

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Screenshot 2024-03-23 at 14 33 41
CTTY commented 6 months ago

I feel like there are multiple things at play here. It can be Glue being slow or Hudi Hadoop relation needs to be refreshed after syncing. The easiest solution may be upgrading to a newer version of Hudi, which uses aws sdk v2 and other optimizations.

Recently there is this PR #10460 to optimize Glue sync and we expect this to be in the upcoming 0.15.0

tvxer commented 6 months ago

I had to include 'hoodie.clean.automatic': 'false', and it has reduce the process from 50 minutes to 10 minutes. Each day when I am writing to new partition,t he cleaning process in Apache Hudi was taking a substantial amount of time, likely because it was working across all partition dates by default,. How to I set the options to only clean the specific dates I am writing to?

CTTY commented 6 months ago

It sounds like cleaning based on the file versions would work for you the best. If these partitions don't have file version that is too old then they would be ignored for cleaning: https://github.com/apache/hudi/blob/a5978cd2308f0f2e501e12040f1fafae8afb86e9/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java#L92

You'll need to set "hoodie.cleaner.policy" to KEEP_LATEST_FILE_VERSIONS to use this