apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.31k stars 2.41k forks source link

[SUPPORT] Spark3.2 encountered duplicate data while reading the hudi bucket MOR table #9244

Open fujianhua168 opened 1 year ago

fujianhua168 commented 1 year ago

Describe the problem you faced A few days ago in the production environment, a datanode in the Hadoop cluster downtimed, it causing the flink streaming write task( for hudi bucket mor table) failed. After restarting the Flink task, when we used Spark3.2 or Presto333 to read data from the table, we found duplicate data under the same primary key ,yet the duplicate records have the same Hudi system field values (_hoodie_commit_time, _hoodie_commit_seqno, _hoodie_filename) .
Note: This Flink write task has been running normally for several days,There were no duplicates record before a datanode downtimed.

image 6d83cedd6b4e7b3b21c76493f0836927 3dcb86d0cd346c65160acc88edb7d8ee

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

ad1happy2go commented 1 year ago

@fujianhua168 Is it possible to provide us the timeline to triage this better when datanode has downtimed.