apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.34k stars 2.42k forks source link

[SUPPORT] High number of duplicated records for certain commits #11989

Open tped17 opened 1 week ago

tped17 commented 1 week ago

Tips before filing an issue

Describe the problem you faced We noticed an issue with two of our datasets wherein we have multiple rows with the same _hoodie_record_key, _hoodie_commit_time and _hoodie_commit_seqno within the same file. Unfortunately all of the problematic commits have been archived. Below is an example of the duplicate records (I've redacted the exact record key, but they are all the same), each sequence number is repeated 64 times.

+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+
|_hoodie_record_key|_hoodie_commit_time|_hoodie_file_name                                                              |_hoodie_commit_seqno        |count|
+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360995|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360996|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360993|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360994|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360994|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360995|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360996|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360993|64   |
+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+

Here's the config we use:

hoodie.parquet.small.file.limit -> 104857600
hoodie.datasource.write.precombine.field -> eventVersion
hoodie.datasource.write.payload.class -> org.apache.hudi.common.model.EmptyHoodieRecordPayload
hoodie.bloom.index.filter.dynamic.max.entries -> 1106137
hoodie.cleaner.fileversions.retained -> 2
hoodie.parquet.max.file.size -> 134217728
hoodie.cleaner.parallelism -> 1500
hoodie.write.lock.client.num_retries -> 10
hoodie.delete.shuffle.parallelism -> 1500
hoodie.bloom.index.prune.by.ranges -> true
hoodie.metadata.enable -> false
hoodie.clean.automatic -> false
hoodie.datasource.write.operation -> upsert
hoodie.write.lock.wait_time_ms -> 600000
hoodie.metrics.reporter.type -> CLOUDWATCH
hoodie.datasource.write.recordkey.field -> timestamp,eventId,subType,trackedItem
hoodie.table.name -> my_table_name
hoodie.datasource.write.table.type -> COPY_ON_WRITE
hoodie.datasource.write.hive_style_partitioning -> true
hoodie.datasource.write.partitions.to.delete -> 
hoodie.write.lock.dynamodb.partition_key -> my_table_name_key
hoodie.cleaner.policy -> KEEP_LATEST_FILE_VERSIONS
hoodie.write.markers.type -> DIRECT
hoodie.metrics.on -> false
hoodie.datasource.write.reconcile.schema -> true
hoodie.datasource.write.keygenerator.class -> org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.cleaner.policy.failed.writes -> LAZY
hoodie.upsert.shuffle.parallelism -> 1500
hoodie.write.lock.dynamodb.table -> HoodieLockTable
hoodie.write.lock.provider -> org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
hoodie.datasource.write.partitionpath.field -> region,year,month,day,hour
hoodie.bloom.index.filter.type -> DYNAMIC_V0
hoodie.write.lock.wait_time_ms_between_retry -> 30000
hoodie.write.concurrency.mode -> optimistic_concurrency_control
hoodie.write.lock.dynamodb.region -> us-east-1

To Reproduce We have not been able to reproduce this intentionally. This only happens occasionally in our dataset and it does not seem to follow any pattern that we've been able to discern.

Expected behavior

It is my understanding that we shouldn't be seeing a large number of duplicates per sequence number.

Environment Description

Additional context For the datasets in which we found the issue we run cleaning and clustering manually and I noticed that our lock keys were incorrectly configured on the cleaning/clustering jobs, so it is possible that we were running cleaning or clustering at the same time as data ingestion or deletion. Please let me know if you need any more info, thank you!

danny0405 commented 6 days ago

Are these duplicates come from different partitions?

tped17 commented 6 days ago

No, these are all in the same partition

danny0405 commented 6 days ago

Do you have any bulk_insert operations on the table?

KnightChess commented 6 days ago

is _hoodie_file_name and _hoodie_commit_seqno is real production data? the file_name token's partitionId look like is different from seqno.

tped17 commented 6 days ago

We do not use any bulk_insert operations, everything should be an upsert. Yes, these are actual file names and sequence numbers

ad1happy2go commented 4 days ago

@tped17 Is it possible to zip the .hoodie directory without metadata partitions and attach to the ticket. if not, Can you provide hudi timeline?