Open dongtingting opened 1 week ago
@beyond1920 any insights here?
@danny0405 @dongtingting Good point. I think your analysis is reasonable. Generate fileid in driver could avoid different fg id for the same bucket id, but it might cost too much memory for some cases.
Describe the problem you faced
I have one job insert into a new partition, job attemp1 failed due to shuffle fetch failed (internal environment problem). I rerun this job (job attemp2), but it failed thows exception:
Job env and parameters:
I checked job attemp1 have multiple file with different fileIdPrefix on bucket 735 and other buckets, like the this:
I analysis this problem and try to answer following questions:
why job attemp1 generate multiple fileid on same bucket? there are task retry and spark speculative task,this will case multiple task try write one bucket file。 because of hudi generate bucket fileid in task(SparkBucketIndexPartitioner.getBucketInfo) not driver,multi writer task of same bucket will generate different uuid 。then it will have multiple fileid on same bucket。
why job attemp2 Find multiple files and throw execption? when job attemp2 go into tagLocation, it will run HoodieSimpleBucketIndex.loadBucketIdToFileIdMappingForPartition, finally go into AbstractTableFileSystemView. getLatestFileSlicesBeforeOrOn. getLatestFileSlicesBeforeOrOnonly function will scan the files on the partition and build fileslices, it do not filter the failed commit files. so in this case, job attemp2 find remaining files of failed job attemp1,and job attemp1 has multi fileid on same bucket,so throw exeption。
(HoodieSimpleBucketIndex.loadBucketIdToFileIdMappingForPartition -->HoodieIndexUtils.getLatestFileSlicesForPartition -> AbstractTableFileSystemView. getLatestFileSlicesBeforeOrOn)
I think maybe we can optimize these two points:
is there anyone can help to check whether the above optimization is reasonable?