Background
At the moment, Cloud Pipeline allows setting the compute nodes' local disk size at a job startup time.
For certain use cases, users may fill the disk and get the out of disk errors, leading to job failure.
The same issue is observed for the out of memory errors. While do have the SWAP support, it is configured as a static value at a job start time.
To address those issues, we shall think of automatic scaling (up) of the filesystem/swap space, according to workload demand.
Approach
Implement API methods:
Increase the disk by a specified delta for the specified run (shall be supported for AWS/GCP, for Azure it shall throw operation not supported error)
Attach a new disk of a specified size to the specified run (shall be supported for AWS/GCP/Azure)
Add a number of preferences to control the scaling behavior:
If a threshold (cluster.instance.hdd.scale.threshold.ratio) is reached for the FS disk - increase it's size and resize the FS (by cluster.instance.hdd.scale.delta.ratio times)
If a threshold is reached for the SWAP disk - attach a new disk (of the same size as the existing swap volume) and enable it for swapping
Background At the moment, Cloud Pipeline allows setting the compute nodes' local disk size at a job startup time. For certain use cases, users may fill the disk and get the out of disk errors, leading to job failure. The same issue is observed for the out of memory errors. While do have the SWAP support, it is configured as a static value at a job start time.
To address those issues, we shall think of automatic scaling (up) of the filesystem/swap space, according to workload demand.
Approach
AWS
/GCP
, forAzure
it shall throwoperation not supported
error)AWS
/GCP
/Azure
)cluster.instance.hdd.scale.threshold.ratio
- float, default:0.75
cluster.instance.hdd.scale.delta.ratio
- float, default:0.5
cluster.instance.swap.scale.threshold.ratio
- float, default:0.75
cluster.instance.hdd.scale.threshold.ratio
) is reached for the FS disk - increase it's size and resize the FS (bycluster.instance.hdd.scale.delta.ratio
times)