Closed chenbodeng719 closed 4 months ago
@chenbodeng719 Can you please let us know what Hudi/Flink/Spark Versions you are using? Are you are getting duplicate rows when reading using spark only OR you are getting the same behaviour when you try to read back using flink too.
@chenbodeng719 Can you please let us know what Hudi/Flink/Spark Versions you are using? Are you are getting duplicate rows when reading using spark only OR you are getting the same behaviour when you try to read back using flink too.
I didnt try on flink. The problem happens when I use spark.
@chenbodeng719 Can you post screenshot of the duplicate records. Are they belong to different file group?
@chenbodeng719 Can you post screenshot of the duplicate records. Are they belong to different file group?
Is there any possibility that I bulk insert a dataset with some duplicate keys, then any following upsert key which is same with dup key, would update the item twice. Like the below photo
Did you only used 0.14.1 only or is this the upgraded table from previous version? can you provide values for hudi meta columns also?
bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert?
Running bulk_insert twice on same data also can cause this issue.
Did you only used 0.14.1 only or is this the upgraded table from previous version? can you provide values for hudi meta columns also?
bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert?
Running bulk_insert twice on same data also can cause this issue.
"if that's the case, upsert is going to update both records. " I guess it's my case. First, bulk insert brings some duplicate key into the table. Then when the upsert with duplicate key comes, it updates the duplicate rows with same key. In my case, two rows for one dup key has been changed. I wonder if there are five rows for one dup key, it updates the five rows?
Yes thats correct, You should remove dups after insterting using bull_insert or not use bulk insert at all in this case.
On Thu, Feb 29, 2024 at 4:04 PM chenbodeng719 @.***> wrote:
Did you only used 0.14.1 only or is this the upgraded table from previous version? can you provide values for hudi meta columns also?
bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert?
Running bulk_insert twice on same data also can cause this issue.
"if that's the case, upsert is going to update both records. " I guess it's my case. First, bulk insert brings some duplicate key into the table. Then when the upsert with duplicate key comes, it updates the duplicate rows with same key. In my case, two rows for one dup key has been changed. I wonder if there are five rows for one dup key, it updates the five rows?
— Reply to this email directly, view it on GitHub https://github.com/apache/hudi/issues/10781#issuecomment-1970850132, or unsubscribe https://github.com/notifications/unsubscribe-auth/APD55YQZYWWO3TQ7UAOZBPTYV4B4ZAVCNFSM6AAAAABD7P3VEOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZQHA2TAMJTGI . You are receiving this because you commented.Message ID: @.***>
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
A clear and concise description of the problem.
To Reproduce
Steps to reproduce the behavior:
1.
I have duplicate rows in my table . The below is my flink hudi config. By consuming data from kafka to upsert hudi_sink table. And I use pyspark to read the table, but I get duplicate data.
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version : 0.14.1
Spark version : 3.3.0
Flink version : 1.16.0
Hive version :
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : s3
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.