NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
44 stars 34 forks source link

Update QualX to return default speedups and fix App Duration for incomplete apps #1089

Closed parthosa closed 3 weeks ago

parthosa commented 4 weeks ago

Fixes #1058,

Issues

Issue 1:

In QualX, we fallback to legacy speedups if the metrics are unavailable (file not found or empty after preprocessing). This PR updates the prediction code to return a speedup of 1 for such apps and logs the reason for missing metrics.

We also introduce a column wasPredicted in per_app.csv as marker for apps that could not be predicted.

Affects:

predict()

Issue 2:

In QualX, CSV metrics from the profiling tool does not have app duration for incomplete applications. Qualification tool provides an estimated app duration for these.

This PR updates QuaX to replace the incorrect app duration in CSV metrics with the estimated duration from the qualification tool output.

Affects:

train(), compare() and predict()

Output:


CASE 1: No supported stages for all apps in the dataset(in this case, single eventlog)

WARNING spark_rapids_tools.tools.qualx.preprocess: Predicted speedup will be 1.0 for application_171615xxxx. Reason: No fully supported stages found.
WARNING spark_rapids_tools.tools.qualx.qualx_main: Predicted speedup will be 1.0 for dataset: qual_20240607xxxx. Check logs for details.


CASE 2: Metrics unavailable for all apps in the dataset(in this case, single eventlog)

WARNING spark_rapids_tools.tools.qualx.preprocess: Predicted speedup will be 1.0 for application_1715312822xxx. Reason: Empty feature tables found after preprocessing: application_information, sql_plan_metrics_for_application, job_+_stage_level_aggregated_task_metrics.
WARNING spark_rapids_tools.tools.qualx.qualx_main: Predicted speedup will be 1.0 for dataset: qual_202406071648xxx. Check logs for details.


CASE 3: Metrics unavailable for some apps in the dataset (cannot calculate exact reason, showing a broad reason):

WARNING spark_rapids_tools.tools.qualx.preprocess: Predicted speedup will be 1.0 for application_1715312822xxx, application_1715312822xxx. Reason: Missing features after preprocessing.

Predicted CSV File:

per_app.csv

|------------------------------|----------------------------|-------------|----------|---------------|--------------------|--------------------|------------------|-------------------|--------------|
| appName                      | appId                      | appDuration | Duration | Duration_pred | Duration_supported | fraction_supported | appDuration_pred | speedup           | wasPredicted |
|------------------------------|----------------------------|-------------|----------|---------------|--------------------|--------------------|------------------|-------------------|--------------|
| qual_20240607155643_e91fB6D3 | application_1686676198xxxx |      887621 |   820739 |        116175 |             820739 | 0.9246502730331977 |           183057 | 4.848855613929893 | True         |
|------------------------------|----------------------------|-------------|----------|---------------|--------------------|--------------------|------------------|-------------------|--------------|
| NDS - Power Run              | application_1715312822xxxx |       46911 |        0 |             0 |                  0 |                0.0 |            46911 |               1.0 | False        |
|------------------------------|----------------------------|-------------|----------|---------------|--------------------|--------------------|------------------|-------------------|--------------|
| NDS - Power Run              | application_1715312822xxxx |       30507 |        0 |             0 |                  0 |                0.0 |            30507 |               1.0 | False        |
|------------------------------|----------------------------|-------------|----------|---------------|--------------------|--------------------|------------------|-------------------|--------------|