Closed dongtingting closed 1 month ago
@danny0405 @beyond1920 can you help me to confirm?
@KnightChess maybe you can give some insights here, also cc @nsivabalan for visibility.
@dongtingting Good catch. Thanks for fire this issue.
It seems writestatus
rdd does not persist.
I would like to check this problem today, would reply later.
@dongtingting nice catch. we can persist writestatus
avoid this issue.
@dongtingting nice catch. we can persist
writestatus
avoid this issue.
thanks very munch for your relay。 I am glad to fix it, later i will create a pr to fix it.
@dongtingting Thanks a lot. Let us know when you have PR ready.
Creating tracking JIRA for the same - https://issues.apache.org/jira/browse/HUDI-8078
@dongtingting Thanks a lot. Let us know when you have PR ready.
Creating tracking JIRA for the same - https://issues.apache.org/jira/browse/HUDI-8078
sorry for late reply. i have create a pr to fix: https://github.com/apache/hudi/pull/11811/files. cc @KnightChess
Describe the problem you faced
There is a job use bulk insert insert overwrite a cow table. We find there are 4 stage run bulk insert write, data write 4 times and only the last stage data remain, other 3 stage writen data is finally remove when finalize write.
all of the four stage on red line do bulk insert write.
more details about the four stage:
This is because the four stage all use writestatus rdd, DatasetBulkInsertOverwriteCommitActionExecutor writeStatus rdd is not persist, this will case upstream rdd(bulk insert write) repeat running 4 times.
DatasetBulkInsertOverwriteCommitActionExecutor getPartitionToReplacedFileIds: use writestatus isEmpty DatasetBulkInsertOverwriteCommitActionExecutor getPartitionToReplacedFileIds: use writestatus distinct HoodieSparkSqlWriter.commitAndPerformPostOperations : use writestatus count HoodieSparkSqlWriter.commitAndPerformPostOperations :use writestatus collect
Upsert(do not use bulk insert) do not have this problem, because they persist writestatus. But BaseDatasetBulkInsertCommitActionExecutor do not persist writestatus. I think we should persist rdd at the beging of BaseDatasetBulkInsertCommitActionExecutor.buildHoodieWriteMetadata, does anyone agree?
To Reproduce
Steps to reproduce the behavior:
create a cow table
test_table
, using simple indexinsert overwrite table use bulk insert
insert overwrite test_table partition (p_date = '20240806') select id, name, p_date from source table