NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
50 stars 37 forks source link

Profiler should identify the delta log ops and generate views for non-delta logs #1031

Closed amahussein closed 4 months ago

amahussein commented 4 months ago

Signed-off-by: Ahmed Hussein (amahussein) a@ahussein.me

Fixes #1023

Credits to @tgravescs

This adds a new option to the profiler tool: --output-sql-ids-aligned that causes the tool to ouput a new table and optionally csv file, that strips out the sqlids of all the delta log related things. The table simply has appId and sqlIds in sorted order.

The idea is you run with this option on the comparable cpu and gpu event logs and then you can join those 2 tables to see which sqlIds are comparable between the cpu and gpu run.

For testing

Output

SQL Ids Cleaned For Alignment:
+--------+-----+
|appIndex|sqlID|
+--------+-----+
|1       |0    |
|1       |1    |
|1       |2    |
|1       |3    |
|1       |4    |
|1       |5    |
|1       |9    |
|1       |10   |
|1       |11   |
|1       |12   |
|1       |14   |
|1       |27   |
|1       |35   |
|1       |36   |

sample output files

tgravescs commented 4 months ago

for tracking original pr - https://github.com/NVIDIA/spark-rapids-tools/pull/1009

amahussein commented 4 months ago

Thanks @tgravescs The output matches the PR-1009's output.

Thanks @parthosa