Open estebanmodmed opened 1 week ago
Hi @estebanmodmed,
It seems the path you provided maybe incomplete, which is causing the Tool to read partial event logs. Databricks stores event logs in a rolling manner as:
ls -l logs/<cluster-id>/eventlog/<cluster-id>_<some-id>/<some-id>
eventlog
eventlog-2024-02-20--04-50.gz
eventlog-2024-02-20--05-00.gz
eventlog-2024-02-20--05-10.gz
eventlog-2024-02-20--05-20.gz
To fix this, I would recommend using the parent directory instead of pointing directly to a specific eventlog file.
Recommended CMD:
spark_rapids qualification --platform databricks-aws --eventlogs logs/cluster_id/eventlog/cluster_id_10_69_238_61/some_id
The application seems to have run on Databricks Fleet instances ('rd-fleet.8xlarge'
). Currently, we don't support fleet instances, but we will update our catalog to include them. However, this is mostly a log message and is not related to the tool’s failure.
With the recommended CMD and path, you should be able to run the tool and get speedup estimation and recommendation about the cluster shape.
Describe the bug I'm using the qualification tool over an eventlog generated by the execution of a Databricks Workflow Job.
I'm getting the following errors when using the qualification tool:
Processing...⣟2024-10-24 10:48:43,531 ERROR rapids.tools.qualification: Failed to execute the prediction model. Using default speed up of 1.0 for all apps. Reason - KeyError:'startTime' ERROR: Could not find elements [('rd-fleet.8xlarge',)] 2024-10-24 10:48:43,542 ERROR rapids.tools.cluster_inference: Error while inferring cluster: Instance type rd-fleet.8xlarge is not found in catalog. Processing...⡿2024-10-24 10:48:43,609 ERROR rapids.tools.AdditionalHeuristics: Cannot apply heuristics for qualification. Reason - FileNotFoundError:[Errno 2] No such file or directory: '/Users/username/repos/nvidia-rapids/qual_20241024134808_8B440b4b/rapids_4_spark_qualification_output/raw_metrics/app-20241022192347-0000/stage_level_aggregated_task_metrics.csv'
After the error is thrown, the tool generates the report but indicates there are no compatible apps.
Steps/Code to reproduce bug
Expected behavior No errors and a recommendation about the cluster shape I should use to improve performance.
Environment details (please complete the following information)
Additional context No additional context.