Closed beyond1920 closed 5 months ago
@xuzifu666 @codope Please help me confirm whether my analysis of this issue is correct. Is the FileIndex#sizeInBytes
better to return the Long.MAX
instead of 0 if FileIndex
has not done partition pruning yet?
Already find the root cause: the query job does not set extensions as HoodieSparkSessionExtension
, so the HoodiePruneFileSourcePartitions
is not taking effect.
BTW, should we use an overestimate size than 0 in HoodieFileIndex#sizeInBytes for those query jobs which forget set HoodieSparkSessionExtension, to avoid broadcast a very large HUDI table, like this patch commit#be9cf?
After apply HUDI-6941 in the internal HUDI version(based on 0.14.0 version), there is a frequent occurrence of the execution plan selecting "broadcast hash join" to broadcast a large HUDI data source.
I tried to investigate the cause of this issue. In those cases, using
HadoopFsRelation
to read HUDI source, and SparkJoinSelection
would callHadoopFsRelation#sizeInBytes
to estimate the relation size to decide whether use broadcast join or not. AndHadoopFsRelation#sizeInBytes
would callHoodieFileIndex#sizeInBytes
. But at the moment, no partitions are loaded because using default lazy Hudi's file-index implementation's file listing mode. SoFileIndex#cachedAllInputFileSlices
is an empty map, thenHadoopFsRelation#sizeInBytes
returns 0, it caused the suboptimal join plan. After apply HUDI-6941, more cases could enabled lazy list mode by default, so the issue has become more frequent.