apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.46k stars 2.43k forks source link

[HUDI-8511] Fix the bug where the Flink table config hoodie.populate.meta.fields is not effective and optimize write performance #12248

Open usberkeley opened 1 week ago

usberkeley commented 1 week ago

Change Logs

1. Fix the bug

2. Optimize write performance

Impact

Improve write performance. After optimization, the write speed with hoodie.populate.meta.fields=false is 42.9% faster than with hoodie.populate.meta.fields=true.

Testing method Consume from the earliest position in Kafka until all messages are consumed (Kafka Lag = 0), and compare the time taken for both.

1)populate meta fields time taken: 21hours and 25mins image

2)no meta fields time taken: 12hours and 14mins image

Risk level (write none, low medium or high below)

medium

Documentation Update

none

Contributor's checklist

usberkeley commented 1 week ago

Infor Summary

1. In #11028, we already fixed the unnecessary rewrite when the schemas are exactly the same, is your benchmark based on the fix then?

Yes

2. Why are allowOperationMetadataField allowed only when populateMetaFields is enabled?

1) Disabling populateMetaFields can reduce the performance overhead of decoding HoodieRecords. However, if allowOperationMetadataField is enabled, decoding performance is still affected even if populateMetaFields is disabled. Therefore, the impact of these two settings on performance is interconnected. 2) Both are metadata fields. populateMetaFields is the main switch, while allowOperationMetadataField just controls the activation of specific metadata fields. When the main switch is off, the sub-switches should have no effect.

3. When enable populateMetaFields, why the number of record key fields must be equal to one?

The Log Scanner needs to regenerate the Record Key. Currently, it only supports a simple key generator, which means there can only be one primary key column.

hudi-bot commented 5 days ago

CI report:

Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build