NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
53 stars 37 forks source link

[BUG] ERROR rapids.tools.qualification: Failed to execute the prediction model #1393

Open estebanmodmed opened 1 week ago

estebanmodmed commented 1 week ago

Describe the bug I'm using the qualification tool over an eventlog generated by the execution of a Databricks Workflow Job.

I'm getting the following errors when using the qualification tool:

Processing...⣟2024-10-24 10:48:43,531 ERROR rapids.tools.qualification: Failed to execute the prediction model. Using default speed up of 1.0 for all apps. Reason - KeyError:'startTime' ERROR: Could not find elements [('rd-fleet.8xlarge',)] 2024-10-24 10:48:43,542 ERROR rapids.tools.cluster_inference: Error while inferring cluster: Instance type rd-fleet.8xlarge is not found in catalog. Processing...⡿2024-10-24 10:48:43,609 ERROR rapids.tools.AdditionalHeuristics: Cannot apply heuristics for qualification. Reason - FileNotFoundError:[Errno 2] No such file or directory: '/Users/username/repos/nvidia-rapids/qual_20241024134808_8B440b4b/rapids_4_spark_qualification_output/raw_metrics/app-20241022192347-0000/stage_level_aggregated_task_metrics.csv'

After the error is thrown, the tool generates the report but indicates there are no compatible apps.

Steps/Code to reproduce bug

Expected behavior No errors and a recommendation about the cluster shape I should use to improve performance.

Environment details (please complete the following information)

Additional context No additional context.

parthosa commented 1 week ago

Hi @estebanmodmed,

  1. It seems the path you provided maybe incomplete, which is causing the Tool to read partial event logs. Databricks stores event logs in a rolling manner as:

    ls -l logs/<cluster-id>/eventlog/<cluster-id>_<some-id>/<some-id>
    eventlog
    eventlog-2024-02-20--04-50.gz
    eventlog-2024-02-20--05-00.gz
    eventlog-2024-02-20--05-10.gz
    eventlog-2024-02-20--05-20.gz

    To fix this, I would recommend using the parent directory instead of pointing directly to a specific eventlog file.

    Recommended CMD:

    spark_rapids qualification  --platform databricks-aws  --eventlogs logs/cluster_id/eventlog/cluster_id_10_69_238_61/some_id
  2. The application seems to have run on Databricks Fleet instances ('rd-fleet.8xlarge'). Currently, we don't support fleet instances, but we will update our catalog to include them. However, this is mostly a log message and is not related to the tool’s failure.

With the recommended CMD and path, you should be able to run the tool and get speedup estimation and recommendation about the cluster shape.