NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
49 stars 36 forks source link

Unused timeout argument in SysCmd Initialization #542

Open parthosa opened 1 year ago

parthosa commented 1 year ago

Description:

The timeout variable is not being passed when constructing an instance of SysCmd. This should be required for ssh or scp commands, such as when copying resources during the diagnostic tool.

Files:

Initialization of timeout

https://github.com/NVIDIA/spark-rapids-tools/blob/24be442afcae256c695143ebcfae6b02e77c57d4/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py#L333-L338

Construction of SysCmd

https://github.com/NVIDIA/spark-rapids-tools/blob/24be442afcae256c695143ebcfae6b02e77c57d4/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py#L503-L511

amahussein commented 1 year ago

@parthosa what is the resolution of this issue? Are you suggesting that we revisit all call sites to make sure that the timeout is set? If so, how do we define a suitable timeout? I think that we should not set timeout copying results/eventlogs because we cannot define a timeout that fits all job sizes.

parthosa commented 1 year ago

There are two issues here:

  1. Currently, all classes inheriting from CMDDriverBase have their timeout value set to 0 by default. We should pass this value when calling SysCmd() in run_sys_cmd().
  2. For specific cases like ssh and scp, we should add a timeout variable in run_sys_cmd() function. This would override the class-scoped timeout variable for these operations.