[SUPPORT] Spark planner choose broadcast hash join for large HUDI data source

beyond1920 commented 7 months ago

After apply HUDI-6941 in the internal HUDI version(based on 0.14.0 version), there is a frequent occurrence of the execution plan selecting "broadcast hash join" to broadcast a large HUDI data source.

I tried to investigate the cause of this issue. In those cases, usingHadoopFsRelation to read HUDI source, and Spark JoinSelection would call HadoopFsRelation#sizeInBytes to estimate the relation size to decide whether use broadcast join or not. And HadoopFsRelation#sizeInBytes would call HoodieFileIndex#sizeInBytes. But at the moment, no partitions are loaded because using default lazy Hudi's file-index implementation's file listing mode. So FileIndex#cachedAllInputFileSlices is an empty map, then HadoopFsRelation#sizeInBytes returns 0, it caused the suboptimal join plan. After apply HUDI-6941, more cases could enabled lazy list mode by default, so the issue has become more frequent.

beyond1920 commented 7 months ago

@xuzifu666 @codope Please help me confirm whether my analysis of this issue is correct. Is the FileIndex#sizeInBytes better to return the Long.MAX instead of 0 if FileIndex has not done partition pruning yet?

beyond1920 commented 7 months ago

Already find the root cause: the query job does not set extensions as HoodieSparkSessionExtension, so the HoodiePruneFileSourcePartitions is not taking effect.

BTW, should we use an overestimate size than 0 in HoodieFileIndex#sizeInBytes for those query jobs which forget set HoodieSparkSessionExtension, to avoid broadcast a very large HUDI table, like this patch commit#be9cf?

apache / hudi

[SUPPORT] Spark planner choose broadcast hash join for large HUDI data source #10343