NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
44 stars 34 forks source link

[BUG] Improve handling of Java cmd exceptions in python user tools #1070

Open amahussein opened 1 month ago

amahussein commented 1 month ago

Describe the bug

The CLI throws an error when the java cmd fails to complete

ERROR rapids.tools.qualification: Failed to download dependencies Error invoking CMD <java -XX:+UseG1GC -Xmx8g -cp...

2024-06-04 22:31:23,970 ERROR root: Qualification. Raised an error in phase [Execution]
Traceback (most recent call last):
  File "~/rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 114, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
  File "~/rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 188, in _execute
    self._run_rapids_tool()
  File "~/rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 643, in _run_rapids_tool
    self._submit_jobs()
  File "~/rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 913, in _submit_jobs
    raise ex
  File "~/rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 909, in _submit_jobs
    result = future.result()
  File "~/.pyenv/versions/3.8-dev/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "~/.pyenv/versions/3.8-dev/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "~/.pyenv/versions/3.8-dev/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "~/rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_job.py", line 105, in run_job
    job_output = self._submit_job(cmd_args)
  File "~/rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_job.py", line 151, in _submit_job
    out_std = self.exec_ctxt.platform.cli.run_sys_cmd(cmd=cmd_args,
  File "~/rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py", line 473, in run_sys_cmd
    return sys_cmd.exec()
  File "~/rapids-tools/user_tools/src/spark_rapids_pytools/common/utilities.py", line 333, in exec
    raise RuntimeError(f'{cmd_err_msg}')

To reproduce, you can simply run the cmd then kill the java process using kill-signal.

### Tasks
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/1088
- [ ] Improve the Python logging. there are issues opened for that purpose.