Since we rely on AccumulableIDs, It is knows that we cannot bind execs (that do not have metrics) to stages.
The Q tool applies some heuristics to do best effort assigning Execs to stages during the Speedup-calculations. However, this is done as intermediate step and it is not reported anywhere.
For example, the latest unsupported report rapids_4_spark_qualification_output_unsupportedOperators.csv lists all projects and other execs as having stageID= -1.
This affects anyone trying to verify the speedup calculations or do aggregations based on unsupported_ops per stages.
Expected behavior
The heuristics used to assign the stageIDs to Execs should be part of the final generated report of the execs. If we there is a oncern that we mix between facts and estimations, we can add another column stating which heuristic used to assign an exec to stage
I found one of the problems in which getStageToExec does not update the stageSet written to the ExecInfo. That's why we have a gap between what the stageMap is telling us compared to the PlanInfos.execInfo
Describe the bug
Since we rely on AccumulableIDs, It is knows that we cannot bind execs (that do not have metrics) to stages. The Q tool applies some heuristics to do best effort assigning Execs to stages during the Speedup-calculations. However, this is done as intermediate step and it is not reported anywhere. For example, the latest unsupported report
rapids_4_spark_qualification_output_unsupportedOperators.csv
lists all projects and other execs as havingstageID= -1
.This affects anyone trying to verify the speedup calculations or do aggregations based on unsupported_ops per stages.
Expected behavior
The heuristics used to assign the stageIDs to Execs should be part of the final generated report of the execs. If we there is a oncern that we mix between facts and estimations, we can add another column stating which heuristic used to assign an exec to stage