NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
44 stars 34 forks source link

Skip Cluster Inference when CSP CLIs are missing or not configured #1035

Closed parthosa closed 1 month ago

parthosa commented 1 month ago

Fixes #1034. Currently, in python user tools, the cluster inference step always attempts to create a CPU cluster object based on the cluster shape. Thus automatically providing cost savings even if the user has not specified the cluster. However, this requires CSP CLIs to be installed and configured; otherwise, the tool crashes.

This PR addresses this issue by handling the error case, warning the user, and continuing to generate the speedup summary.

Changes

Python/User Tools

  1. Added error handling around cluster inference to log an appropriate message and skip the cluster inference.
  2. Since the cpuClusterProxy context will not be created in such cases, cost savings will be automatically disabled.

Output

CMD:

spark_rapids qualification --eventlogs <eventlog> --platform <platform> --tools_jar <jar> 

CLI not installed

New Output (Continues generating speedups)
2024-05-23 15:35:34,687 ERROR rapids.tools.qualification: Unable to process cluster information. Cost savings will be disabled. Reason - RuntimeError:Error invoking CMD :
    | /bin/bash: line 1: aws: command not found
2024-05-23 15:35:34,731 INFO rapids.tools.qualification: Generating GPU Estimated Speedup: as /home/ubuntu/tools_run/qual_20240523153451_6C7D6D6f/qualification_summary.csv
Previous Output (Tools Crashes)
2024-05-23 15:38:39,438 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 114, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 246, in _collect_result
    self._process_output()
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 867, in _process_output
    self.__infer_cluster_and_update_savings(cluster_info_df)
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 927, in __infer_cluster_and_update_savings
    cpu_cluster_obj = ClusterInference(platform=self.ctxt.platform).infer_cpu_cluster(cluster_info_df)
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/cluster_inference.py", line 83, in infer_cpu_cluster
    return self.platform.load_cluster_by_prop(cluster_props_new, is_inferred=True)
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 818, in load_cluster_by_prop
    return self._construct_cluster_from_props(cluster=cluster,
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/emr.py", line 74, in _construct_cluster_from_props
    return EMRCluster(self, is_inferred=is_inferred).set_connection(cluster_id=cluster, props=props)
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 992, in set_connection
    self._init_nodes()
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/emr.py", line 460, in _init_nodes
    self.nodes = self.__create_node_from_instances()
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/emr.py", line 429, in __create_node_from_instances
    c_node.fetch_and_set_hw_info(self.cli)
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 183, in fetch_and_set_hw_info
    self._pull_and_set_mc_props(cli)
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 162, in _pull_and_set_mc_props
    instances_description = cli.exec_platform_describe_node_instance(self) if cli else None
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 536, in exec_platform_describe_node_instance
    self.instance_descriptions_cache[key] = self._exec_platform_describe_node_instance(node)
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/emr.py", line 246, in _exec_platform_describe_node_instance
    raw_instance_descriptions = super()._exec_platform_describe_node_instance(node)
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 526, in _exec_platform_describe_node_instance
    return self.run_sys_cmd(cmd_params)
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 473, in run_sys_cmd
    return sys_cmd.exec()
  File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/utilities.py", line 333, in exec
    raise RuntimeError(f'{cmd_err_msg}')
RuntimeError: Error invoking CMD :
    | /bin/bash: line 1: aws: command not found

Processing Completed!

CLI not configured

New Output (Continues generating speedups)
us-central1-a>
2024-05-23 08:55:44,709 ERROR rapids.tools.qualification: Unable to process cluster information. Cost savings will be disabled. Reason - RuntimeError:Error invoking CMD :
    | Reauthentication required.
    | ERROR: (gcloud.compute.machine-types.describe) There was a problem refreshing your current auth tokens: Reauthentication failed. Please run `gcloud auth login` to complete reauthentication with SAML.
    | Please run:
    |
    |   $ gcloud auth login
    |
    | to obtain new credentials.
    |
    | If you have already logged in with a different account, run:
    |
    |   $ gcloud config set account ACCOUNT
    |
    | to select an already authenticated account to use.
2024-05-23 08:55:44,732 INFO rapids.tools.qualification: Generating GPU Estimated Speedup: as /Users/psarthi/Work/tools-run/qual_20240523155525_a1c049a5/qualification_summary.csv
Previous Output (Tools Crashes)
2024-05-23 08:56:38,766 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 114, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 246, in _collect_result
    self._process_output()
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 867, in _process_output
    self.__infer_cluster_and_update_savings(cluster_info_df)
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 927, in __infer_cluster_and_update_savings
    cpu_cluster_obj = ClusterInference(platform=self.ctxt.platform).infer_cpu_cluster(cluster_info_df)
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/cluster_inference.py", line 83, in infer_cpu_cluster
    return self.platform.load_cluster_by_prop(cluster_props_new, is_inferred=True)
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 818, in load_cluster_by_prop
    return self._construct_cluster_from_props(cluster=cluster,
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/dataproc.py", line 86, in _construct_cluster_from_props
    return DataprocCluster(self, is_inferred=is_inferred).set_connection(cluster_id=cluster, props=props)
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 992, in set_connection
    self._init_nodes()
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/dataproc.py", line 430, in _init_nodes
    worker.fetch_and_set_hw_info(self.cli)
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 183, in fetch_and_set_hw_info
    self._pull_and_set_mc_props(cli)
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 162, in _pull_and_set_mc_props
    instances_description = cli.exec_platform_describe_node_instance(self) if cli else None
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 536, in exec_platform_describe_node_instance
    self.instance_descriptions_cache[key] = self._exec_platform_describe_node_instance(node)
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 526, in _exec_platform_describe_node_instance
    return self.run_sys_cmd(cmd_params)
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 473, in run_sys_cmd
    return sys_cmd.exec()
  File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/utilities.py", line 333, in exec
    raise RuntimeError(f'{cmd_err_msg}')
RuntimeError: Error invoking CMD :
    | Reauthentication required.
    | ERROR: (gcloud.compute.machine-types.describe) There was a problem refreshing your current auth tokens: Reauthentication failed. Please run `gcloud auth login` to complete reauthentication with SAML.
    | Please run:
    |
    |   $ gcloud auth login
    |
    | to obtain new credentials.
    |
    | If you have already logged in with a different account, run:
    |
    |   $ gcloud config set account ACCOUNT
    |
    | to select an already authenticated account to use.

Processing Completed!

Behaviour

This PR does not change the behaviour when a user explicitly requests cost savings. Example:

spark_rapids qualification --cluster <cluster name> --eventlogs <eventlogs> --filter_apps SAVINGS