Implement compute nodes' disks/swap autoscaling

sidoruka commented 4 years ago

Background At the moment, Cloud Pipeline allows setting the compute nodes' local disk size at a job startup time. For certain use cases, users may fill the disk and get the out of disk errors, leading to job failure. The same issue is observed for the out of memory errors. While do have the SWAP support, it is configured as a static value at a job start time.

To address those issues, we shall think of automatic scaling (up) of the filesystem/swap space, according to workload demand.

Approach

Implement API methods:
- Increase the disk by a specified delta for the specified run (shall be supported for AWS/GCP, for Azure it shall throw operation not supported error)
- Attach a new disk of a specified size to the specified run (shall be supported for AWS/GCP/Azure)
Add a number of preferences to control the scaling behavior:
- cluster.instance.hdd.scale.threshold.ratio - float, default: 0.75
- cluster.instance.hdd.scale.delta.ratio - float, default: 0.5
- cluster.instance.swap.scale.threshold.ratio - float, default: 0.75
Implement a daemonset that will:
- Monitor compute node's disk/swap
- If a threshold (cluster.instance.hdd.scale.threshold.ratio) is reached for the FS disk - increase it's size and resize the FS (by cluster.instance.hdd.scale.delta.ratio times)
- If a threshold is reached for the SWAP disk - attach a new disk (of the same size as the existing swap volume) and enable it for swapping

tcibinan commented 4 years ago

Pull request #855 was cherry-picked to release/0.15 via 6b27ba22b878027c37fc0679f4be7ef48a078f35.

tcibinan commented 4 years ago

Pull request #913 was cherry-picked to release/0.15 via 4453ee5841edbb5904481f81b824ae363d00271d.

epam / cloud-pipeline

Implement compute nodes' disks/swap autoscaling #837