Open tped17 opened 1 week ago
Are these duplicates come from different partitions?
No, these are all in the same partition
Do you have any bulk_insert operations on the table?
is _hoodie_file_name
and _hoodie_commit_seqno
is real production data? the file_name token's partitionId look like is different from seqno.
We do not use any bulk_insert operations, everything should be an upsert. Yes, these are actual file names and sequence numbers
@tped17 Is it possible to zip the .hoodie directory without metadata partitions and attach to the ticket. if not, Can you provide hudi timeline?
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced We noticed an issue with two of our datasets wherein we have multiple rows with the same _hoodie_record_key, _hoodie_commit_time and _hoodie_commit_seqno within the same file. Unfortunately all of the problematic commits have been archived. Below is an example of the duplicate records (I've redacted the exact record key, but they are all the same), each sequence number is repeated 64 times.
Here's the config we use:
To Reproduce We have not been able to reproduce this intentionally. This only happens occasionally in our dataset and it does not seem to follow any pattern that we've been able to discern.
Expected behavior
It is my understanding that we shouldn't be seeing a large number of duplicates per sequence number.
Environment Description
Hudi version : 0.11.1
Spark version : 3.2.1
Hive version : 3.1.3
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context For the datasets in which we found the issue we run cleaning and clustering manually and I noticed that our lock keys were incorrectly configured on the cleaning/clustering jobs, so it is possible that we were running cleaning or clustering at the same time as data ingestion or deletion. Please let me know if you need any more info, thank you!