NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
50 stars 37 forks source link

[BUG] Tools have some duplicate rows in data_source_information CSV file #1247

Open amahussein opened 2 months ago

amahussein commented 2 months ago

Describe the bug

Running both Q/P tools on eventlog ${QUAL_DATA}/dataproc-2.1/nds-h/powerrun-unoptimized-3k-parquet-train/eventlogs/cpu/ shows some duplicate entries in data_cource_information.csv.

There is a couple of possibilities:

amahussein commented 3 weeks ago

This issue is caused by the way we process the matadata and try to link it to a specific nodeId Inside checkMetadataForReadSchema we always pick the first node in the list. This leads to multiple rows contraining same nodeId although it comes from different plans/nodes. This affects the correctness of the IO metrics because we completely ignore the remaining scan nodes in the list.

It is tricky to match nodes back PlanInfo.

amahussein commented 2 weeks ago

After some thoughts, this is the best strategy to fix this issue:

Create a custom ToolsPlanGraph builder that accepts a call back function. While visiting the nodes recursively, the call back function will capture all ReadNodes and pull the metadata from the SparkPlanInfo if any. This implementation will affect the flow of the tools because once this is implementation is in place, we can use it for both V1/V2 reads and we can get rid of the legacy implementation that only relies on sparkPlanGraph.