NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
56 stars 38 forks source link

[BUG] Missing Metrics in Photon Event Logs Affecting QualX Predictions #1388

Closed parthosa closed 1 month ago

parthosa commented 1 month ago

Describe the bug

Photon event logs do not store certain metrics, such as scan time, shuffle write time, and peak execution memory, in the same format as CPU Spark event logs. These metrics are used by QualX for prediction purposes.

Missing Metrics/Features

Feature Type
scan_time Spark Metric
sw_writeTime_mean Spark Metric
peakExecutionMemory_max Spark Metric
sqlOp_SubqueryBroadcast Exec
sqlOp_RunningWindowFunction Exec
sqlOp_Expand Exec

Solution

After investigation we found alternative ways to calculate some of these metrics:

  1. PhotonScan nodes provide a cumulative time metric that can be used as a replacement for the scan time metric.
  2. shuffle write time can be reconstructed using the following metrics:
    1. time taken waiting on file write IO (part of shuffle file write)
    2. time taken to sort rows by partition ID (part of shuffle file write)
    3. time taken to convert columns to rows (part of shuffle file write)
  3. Photon nodes provide a peak memory usage metric, which can be used for the peak execution memory metric.

cc: @amahussein @leewyang