Open amahussein opened 2 months ago
This issue is caused by the way we process the matadata and try to link it to a specific nodeId
Inside checkMetadataForReadSchema
we always pick the first node in the list. This leads to multiple rows contraining same nodeId although it comes from different plans/nodes.
This affects the correctness of the IO metrics because we completely ignore the remaining scan nodes in the list.
It is tricky to match nodes back PlanInfo.
After some thoughts, this is the best strategy to fix this issue:
Create a custom ToolsPlanGraph builder that accepts a call back function. While visiting the nodes recursively, the call back function will capture all ReadNodes and pull the metadata from the SparkPlanInfo if any. This implementation will affect the flow of the tools because once this is implementation is in place, we can use it for both V1/V2 reads and we can get rid of the legacy implementation that only relies on sparkPlanGraph.
Describe the bug
Running both Q/P tools on eventlog
${QUAL_DATA}/dataproc-2.1/nds-h/powerrun-unoptimized-3k-parquet-train/eventlogs/cpu/
shows some duplicate entries indata_cource_information.csv
.There is a couple of possibilities: