Open parthosa opened 1 year ago
@parthosa what is the resolution of this issue? Are you suggesting that we revisit all call sites to make sure that the timeout is set? If so, how do we define a suitable timeout? I think that we should not set timeout copying results/eventlogs because we cannot define a timeout that fits all job sizes.
There are two issues here:
CMDDriverBase
have their timeout value set to 0 by default. We should pass this value when calling SysCmd()
in run_sys_cmd()
.ssh
and scp
, we should add a timeout
variable in run_sys_cmd()
function. This would override the class-scoped timeout
variable for these operations.
Description:
The
timeout
variable is not being passed when constructing an instance ofSysCmd
. This should be required forssh
orscp
commands, such as when copying resources during the diagnostic tool.Files:
Initialization of
timeout
https://github.com/NVIDIA/spark-rapids-tools/blob/24be442afcae256c695143ebcfae6b02e77c57d4/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py#L333-L338
Construction of
SysCmd
https://github.com/NVIDIA/spark-rapids-tools/blob/24be442afcae256c695143ebcfae6b02e77c57d4/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py#L503-L511