[BUG] Tools have some duplicate rows in data_source_information CSV file

amahussein commented 2 months ago

Describe the bug

Running both Q/P tools on eventlog ${QUAL_DATA}/dataproc-2.1/nds-h/powerrun-unoptimized-3k-parquet-train/eventlogs/cpu/ shows some duplicate entries in data_cource_information.csv.

There is a couple of possibilities:

Some nodes are processed twice as datasourceV1 and datasourceV2
there is a column such as "PartitionFilters" that does not show the difference between the entries.

amahussein commented 3 weeks ago

This issue is caused by the way we process the matadata and try to link it to a specific nodeId Inside checkMetadataForReadSchema we always pick the first node in the list. This leads to multiple rows contraining same nodeId although it comes from different plans/nodes. This affects the correctness of the IO metrics because we completely ignore the remaining scan nodes in the list.

It is tricky to match nodes back PlanInfo.

Add a lookUp table to avoid matching on the same node multiple times. this is not enough, because this won't lead to best matches.
One possible way is to sort the lists by the length of the metadatafields. Then try to walk in order for each matching schema. This guarantees that we match long schemas to long nodes first if possible.

amahussein commented 2 weeks ago

After some thoughts, this is the best strategy to fix this issue:

Create a custom ToolsPlanGraph builder that accepts a call back function. While visiting the nodes recursively, the call back function will capture all ReadNodes and pull the metadata from the SparkPlanInfo if any. This implementation will affect the flow of the tools because once this is implementation is in place, we can use it for both V1/V2 reads and we can get rid of the legacy implementation that only relies on sparkPlanGraph.

NVIDIA / spark-rapids-tools

[BUG] Tools have some duplicate rows in data_source_information CSV file #1247