duplicated records when use insert overwrite

apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.

https://hudi.apache.org/

Apache License 2.0

5.39k stars 2.42k forks source link

duplicated records when use insert overwrite #11358

Open njalan opened 5 months ago

njalan commented 5 months ago

There are multiple commit time exists in hoodie table and also duplicated records exists when use insert overwrite into the target table. There are like 10 tables join in the query.

Environment Description

Hudi version : 0.9
Spark version : 3.0.1
Hive version : 3.2
Hadoop version :3.2
Storage (HDFS/S3/GCS..) : s3
Running on Docker? (yes/no) :no

ad1happy2go commented 5 months ago

@njalan Are you using multi writers? Can you come up with a reproducible script. You are using very old Hudi version though.

ad1happy2go commented 5 months ago

@njalan Also as I understood, data what you are writing is output of 10 tables. SO when you are doing insert_overwrite, Does that source data frame contains dups?

njalan commented 4 months ago

@ad1happy2go I don't think I am using multi writers. is there any parameter for multi writers? We have checked after that there is dup records. In my understanding that there should me only one commit time in final table when I use insert_overwrite. Why I can see two multiple commit times from the final table and one commit time is that from target table before this overwrite.

ad1happy2go commented 4 months ago

@njalan If the data which you are inserting has dups, then insert overwrite will create dups in the table.

Can you please share us the timeline to look further