NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
44 stars 34 forks source link

[FEA] Tools should Identify the delta log operations and generate views for non-delta logs #1023

Closed amahussein closed 1 month ago

amahussein commented 1 month ago

Is your feature request related to a problem? Please describe.

We have some event logs from cpu and gpu event logs on Databricks where the SQL Ids do not line up to make them comparable. After investigation I found that most of the issues were due to delta log metadata reads. This includes delta checkpoint files, the delta_log json files and dealing delta caching stuff.

This adds a new option to the profiler tool: --output-sql-ids-aligned that causes the tool to ouput a new table and optionally csv file, that strips out the sqlids of all the delta log related things. The table simply has appId and sqlIds in sorted order.