epam / cloud-pipeline

Cloud agnostic genomics analysis, scientific computation and storage platform
https://cloud-pipeline.com
Apache License 2.0
144 stars 58 forks source link

Increase grid engine autoscaling process robustness #3583

Open tcibinan opened 4 days ago

tcibinan commented 4 days ago

Background

Currently, grid engine autoscaling process does not handle the following issues:

API token expiration

Grid engine autoscaler uses the same API token which was used during launch time. In case it expires, grid engine autoscaler halts until it is restarted with a valid API token. Grid engine autoscaler should automatically refresh its API token.

Logging failure

In specific cases grid engine autoscaler may crash due to unsuccessful logging. No failure during logging should lead to a crash.

SSH connection timeout

When a worker nodes attaches to a cluster it uses SSH to register itself within the cluster. Sometimes SSH connections may timeout due to DNS and other issues. We should always retry SSH connections.