Duplicate Row in Same Partition using Global Bloom Index

Raghvendradubey commented 10 months ago

Hi Team,

I am facing an issue of duplicate record keys while data upserts into Hudi on EMR.

Hudi Jar - hudi-spark3.1.2-bundle_2.12-0.10.1.jar

EMR Version - emr-6.5.0

Workflow - files on S3 -> EMR(hudi) -> Hudi Tables(S3)

Schedule - once in a day

Insert Data Size - 5 to 10 MB per batch

Hudi Configuration for Upsert -

hudi_options = { 'hoodie.table.name': "txn_table" 'hoodie.datasource.write.recordkey.field': "transaction_id", 'hoodie.datasource.write.partitionpath.field': 'billing_date', 'hoodie.datasource.write.table.name': "txn_table", 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.precombine.field': 'transaction_id', 'hoodie.index.type': "GLOBAL_BLOOM", 'hoodie.bloom.index.update.partition.path': "true", 'hoodie.upsert.shuffle.parallelism': 10, 'hoodie.insert.shuffle.parallelism': 10, 'hoodie.datasource.hive_sync.database': "dwh", 'hoodie.datasource.hive_sync.table': "txn_table", 'hoodie.datasource.hive_sync.partition_fields': "billing_date", 'hoodie.datasource.write.hive_style_partitioning': "true", 'hoodie.datasource.hive_sync.enable': "true", 'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled': "true", 'hoodie.datasource.hive_sync.support_timestamp': "true", 'hoodie.metadata.enable': "true" }

Issue Occurrence - It's been around a month while running our job in production but this issue has been seen for the first time. Even when I tried to reproduce the issue with the same dataset it was not reproducible, records updated successfully.

Issue Steps -

1 - There is a batch of data for which first we do insert in txn_table, which has unique id through out the partition i.e transaction_id(defined as record key) 2 - Next day, on the update of the record key a new row is created with same record key in same partition with updated value. 3 - both the duplicate rows were able to be read but when I try to update then it updates only the latest row. 4 - On checking the parquet file, a duplicate record with updated value was present in a different file in the same partition.

Steps to Reproduce -

Issue is not reproducible, even when same dataset tried to ingest again with same configuration Upsert was fine.

Please let me know If I am missing some configuration.

Thanks Raghvendra

ad1happy2go commented 10 months ago

@Raghvendradubey Can you share the table properties. Is it COW or MOR? I noticed you turned on the flag 'hoodie.bloom.index.update.partition.path', Did the partition value updated for the duplicate record you are noticing? If yes, Did you applied the similar behaviour when you tried to reproduce?

ad1happy2go commented 10 months ago

@Raghvendradubey Also I noticed the metadata table is enabled and you using 0.10.0. You may also want to upgrade the hudi version to 0.12.3 or 0.13.1.

Raghvendradubey commented 10 months ago

@ad1happy2go It's COW, partition value not updated cause I was trying to update same partition record key, and it resulted into 2 record key in same partition.

ad1happy2go commented 10 months ago

@Raghvendradubey So if I understood It correctly, You got this issue when it tried to update partition path? That may be the root cause. Did you tried the similar thing when you tried to reproduce with small data.

Raghvendradubey commented 10 months ago

@ad1happy2go "You got this issue when it tried to update partition path" - Yes, "Did you tried the similar thing when you tried to reproduce with small data" yes when the same thing tried to reproduce the issue with same source data then it worked fine. It's been around a month with this hudi configuration but this issue has been seen first time. rest of the day it worked fine.

voonhous commented 10 months ago

FWIU, this is a sporadic thing that OP is not able to reproduce anymore.

Might be related to this issue: https://github.com/apache/hudi/pull/9035

One way to determine if it is caused by this issue is:

Identify the 2 parquet files that the 2 files are situated in
If it is caused by the issue linked above, the commit time should be the same (assuming COW table)
If it is this issue and if you are still able to access your Spark tracking URL, you can probably look at the timing of the stages and see if there's a zombie executor/task has not been killed after reconcileAgainstMarker has been called.

Raghvendradubey commented 10 months ago

@voonhous I saw this issue again in another dataset which is on default Bloom Index, and again the same issue. I verified your steps

there are two files but commit time is hoodie_commit_time is different for duplicate record, also I do not find any issue in error log file for the specific time also all the task successfully executed when duplicate records has been written.

ad1happy2go commented 9 months ago

@Raghvendradubey Did you not saw any task failures in spark UI also as pointed out by @voonhous ?

Raghvendradubey commented 9 months ago

@ad1happy2go no failed task, I verified all tasks for all the stages nothing failed or reattempted.

Raghvendradubey commented 9 months ago

this is metadata field of duplicate record - _hoodie_commit_time	_hoodie_commit_seqno	_hoodie_record_key	_hoodie_partition_path	_hoodie_file_name
20230905093840399	20230905093840399_288_11214	nomupay_transaction_id:NP-b62e25f04e29205777612835243	processor_name=PLANET/oas_stamp=2023-08-01 19:40:00.0	6a77386e-4a50-4648-916d-568d72f349e1-0_288-60-2990_20230905093840399.parquet
20230801210728594	20230801210728594_0_5	nomupay_transaction_id:NP-b62e25f04e29205777612835243	processor_name=PLANET/oas_stamp=2023-08-01 19:40:00.0	55d3f136-4c9e-47c2-8797-5c7bc0d0163a-0_0-33-1641_20230801210728594.parquet

Raghvendradubey commented 9 months ago

can somebody help here to identify the issue?

ad1happy2go commented 9 months ago

@Raghvendradubey I worked on this but also couldn't reproduce on my end. I am trying with a bigger dataset. It's difficult to identify as the code is not failing and we are also not seeing any task failures/reattempts. Will update you soon. Thanks.

ad1happy2go commented 5 months ago

@Raghvendradubey This was a bug which got fixed with this PR - https://issues.apache.org/jira/browse/HUDI-6946

Please try with 0.14.1 and let us know in case you still faces the issue. Thanks.

ad1happy2go commented 5 months ago

@Raghvendradubey Did you got a chance to try this one? Do you still see this issue?

nsivabalan commented 3 months ago

hey @Raghvendradubey : any follow ups on this.

Raghvendradubey commented 3 months ago

Hi @ad1happy2go @nsivabalan After migrating to new Hudi version 0.14.0 I didn't face this issue again, thanks for your support.

ad1happy2go commented 3 months ago

Great! Thanks @Raghvendradubey . Closing this issue.

apache / hudi

Duplicate Row in Same Partition using Global Bloom Index #9536