Currently, grid engine autoscaling process does not handle the following issues:
API token expiration
Grid engine autoscaler uses the same API token which was used during launch time. In case it expires, grid engine autoscaler halts until it is restarted with a valid API token. Grid engine autoscaler should automatically refresh its API token.
Logging failure
In specific cases grid engine autoscaler may crash due to unsuccessful logging. No failure during logging should lead to a crash.
SSH connection timeout
When a worker nodes attaches to a cluster it uses SSH to register itself within the cluster. Sometimes SSH connections may timeout due to DNS and other issues. We should always retry SSH connections.
Background
Currently, grid engine autoscaling process does not handle the following issues:
API token expiration
Grid engine autoscaler uses the same API token which was used during launch time. In case it expires, grid engine autoscaler halts until it is restarted with a valid API token. Grid engine autoscaler should automatically refresh its API token.
Logging failure
In specific cases grid engine autoscaler may crash due to unsuccessful logging. No failure during logging should lead to a crash.
SSH connection timeout
When a worker nodes attaches to a cluster it uses SSH to register itself within the cluster. Sometimes SSH connections may timeout due to DNS and other issues. We should always retry SSH connections.