Fixes #1034. Currently, in python user tools, the cluster inference step always attempts to create a CPU cluster object based on the cluster shape. Thus automatically providing cost savings even if the user has not specified the cluster. However, this requires CSP CLIs to be installed and configured; otherwise, the tool crashes.
This PR addresses this issue by handling the error case, warning the user, and continuing to generate the speedup summary.
Changes
Python/User Tools
Added error handling around cluster inference to log an appropriate message and skip the cluster inference.
Since the cpuClusterProxy context will not be created in such cases, cost savings will be automatically disabled.
2024-05-23 15:35:34,687 ERROR rapids.tools.qualification: Unable to process cluster information. Cost savings will be disabled. Reason - RuntimeError:Error invoking CMD :
| /bin/bash: line 1: aws: command not found
2024-05-23 15:35:34,731 INFO rapids.tools.qualification: Generating GPU Estimated Speedup: as /home/ubuntu/tools_run/qual_20240523153451_6C7D6D6f/qualification_summary.csv
Previous Output (Tools Crashes)
2024-05-23 15:38:39,438 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 114, in wrapper
func_cb(self, *args, **kwargs) # pylint: disable=not-callable
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 246, in _collect_result
self._process_output()
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 867, in _process_output
self.__infer_cluster_and_update_savings(cluster_info_df)
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 927, in __infer_cluster_and_update_savings
cpu_cluster_obj = ClusterInference(platform=self.ctxt.platform).infer_cpu_cluster(cluster_info_df)
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/cluster_inference.py", line 83, in infer_cpu_cluster
return self.platform.load_cluster_by_prop(cluster_props_new, is_inferred=True)
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 818, in load_cluster_by_prop
return self._construct_cluster_from_props(cluster=cluster,
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/emr.py", line 74, in _construct_cluster_from_props
return EMRCluster(self, is_inferred=is_inferred).set_connection(cluster_id=cluster, props=props)
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 992, in set_connection
self._init_nodes()
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/emr.py", line 460, in _init_nodes
self.nodes = self.__create_node_from_instances()
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/emr.py", line 429, in __create_node_from_instances
c_node.fetch_and_set_hw_info(self.cli)
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 183, in fetch_and_set_hw_info
self._pull_and_set_mc_props(cli)
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 162, in _pull_and_set_mc_props
instances_description = cli.exec_platform_describe_node_instance(self) if cli else None
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 536, in exec_platform_describe_node_instance
self.instance_descriptions_cache[key] = self._exec_platform_describe_node_instance(node)
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/emr.py", line 246, in _exec_platform_describe_node_instance
raw_instance_descriptions = super()._exec_platform_describe_node_instance(node)
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 526, in _exec_platform_describe_node_instance
return self.run_sys_cmd(cmd_params)
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 473, in run_sys_cmd
return sys_cmd.exec()
File "/home/ubuntu/spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/utilities.py", line 333, in exec
raise RuntimeError(f'{cmd_err_msg}')
RuntimeError: Error invoking CMD :
| /bin/bash: line 1: aws: command not found
Processing Completed!
CLI not configured
New Output (Continues generating speedups)
us-central1-a>
2024-05-23 08:55:44,709 ERROR rapids.tools.qualification: Unable to process cluster information. Cost savings will be disabled. Reason - RuntimeError:Error invoking CMD :
| Reauthentication required.
| ERROR: (gcloud.compute.machine-types.describe) There was a problem refreshing your current auth tokens: Reauthentication failed. Please run `gcloud auth login` to complete reauthentication with SAML.
| Please run:
|
| $ gcloud auth login
|
| to obtain new credentials.
|
| If you have already logged in with a different account, run:
|
| $ gcloud config set account ACCOUNT
|
| to select an already authenticated account to use.
2024-05-23 08:55:44,732 INFO rapids.tools.qualification: Generating GPU Estimated Speedup: as /Users/psarthi/Work/tools-run/qual_20240523155525_a1c049a5/qualification_summary.csv
Previous Output (Tools Crashes)
2024-05-23 08:56:38,766 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 114, in wrapper
func_cb(self, *args, **kwargs) # pylint: disable=not-callable
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 246, in _collect_result
self._process_output()
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 867, in _process_output
self.__infer_cluster_and_update_savings(cluster_info_df)
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 927, in __infer_cluster_and_update_savings
cpu_cluster_obj = ClusterInference(platform=self.ctxt.platform).infer_cpu_cluster(cluster_info_df)
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/cluster_inference.py", line 83, in infer_cpu_cluster
return self.platform.load_cluster_by_prop(cluster_props_new, is_inferred=True)
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 818, in load_cluster_by_prop
return self._construct_cluster_from_props(cluster=cluster,
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/dataproc.py", line 86, in _construct_cluster_from_props
return DataprocCluster(self, is_inferred=is_inferred).set_connection(cluster_id=cluster, props=props)
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 992, in set_connection
self._init_nodes()
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/dataproc.py", line 430, in _init_nodes
worker.fetch_and_set_hw_info(self.cli)
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 183, in fetch_and_set_hw_info
self._pull_and_set_mc_props(cli)
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 162, in _pull_and_set_mc_props
instances_description = cli.exec_platform_describe_node_instance(self) if cli else None
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 536, in exec_platform_describe_node_instance
self.instance_descriptions_cache[key] = self._exec_platform_describe_node_instance(node)
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 526, in _exec_platform_describe_node_instance
return self.run_sys_cmd(cmd_params)
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 473, in run_sys_cmd
return sys_cmd.exec()
File "/Users/psarthi/Work/spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/utilities.py", line 333, in exec
raise RuntimeError(f'{cmd_err_msg}')
RuntimeError: Error invoking CMD :
| Reauthentication required.
| ERROR: (gcloud.compute.machine-types.describe) There was a problem refreshing your current auth tokens: Reauthentication failed. Please run `gcloud auth login` to complete reauthentication with SAML.
| Please run:
|
| $ gcloud auth login
|
| to obtain new credentials.
|
| If you have already logged in with a different account, run:
|
| $ gcloud config set account ACCOUNT
|
| to select an already authenticated account to use.
Processing Completed!
Behaviour
This PR does not change the behaviour when a user explicitly requests cost savings. Example:
Fixes #1034. Currently, in python user tools, the cluster inference step always attempts to create a CPU cluster object based on the cluster shape. Thus automatically providing cost savings even if the user has not specified the cluster. However, this requires CSP CLIs to be installed and configured; otherwise, the tool crashes.
This PR addresses this issue by handling the error case, warning the user, and continuing to generate the speedup summary.
Changes
Python/User Tools
cpuClusterProxy
context will not be created in such cases, cost savings will be automatically disabled.Output
CMD:
CLI not installed
New Output (Continues generating speedups)
Previous Output (Tools Crashes)
CLI not configured
New Output (Continues generating speedups)
Previous Output (Tools Crashes)
Behaviour
This PR does not change the behaviour when a user explicitly requests cost savings. Example: