Investigating autoscale trainer job on PaddleCloud

Yancey1989 commented 7 years ago

Autoscaling Trainer job on PaddleCloud

Background

A Paddle training job contains several trainer instances(Kubernetes Job), several parameter server instances((Kubernetes ReplicaSet) and a master process(only fault-tolerant mode, Kubernetes ReplicaSet). We hope PaddleCloud has the ability which can autoscaling the count of trainer instances. This issue would like to discuss how to implement this feature.

HPA on Kubernetes

With Horizontal Pod Autoscaling(HPA), Kubernetes is able to scale the number of Pod automatically, users will use HPA as:

kubectl autoscale rc foo --min=2 --max=5 --cpu-percent=80

Which:

min is the low limit for the number of pods.
max is the high limit for the number of pods.
--cpu-percent is the target average CPU utilization.

Fetch metrics

From Kubernetes Doc:

The Horizontal Pod Autoscaler is implemented as a control loop, with a period controlled by the controller manager’s --horizontal-pod-autoscaler-sync-period flag (with a default value of 30 seconds).

Heapster access [heapster]https://github.com/kubernetes/heapster enables Container Cluster Monitoring and Performance Analysis, and use InfluxDB as the backend storage.
REST client access

Problem

From now on, the HPA only support ReplicaSet and Deployment, but the trainer is a Job in Kubernets.

Possible solutions

Fix HPA to support Job Resource The Job in Kubernetes supports scale, so maybe we can extend HPA to support Job, and I think it's a better way.
Custom server to scale Job instance We can also develop another simple service to check the metrics for a sync period and call scale API to scale the Trainer instance.

typhoonzero commented 7 years ago

Need to let HPA to scale jobs.

putcn commented 7 years ago

Can we convert trainers to the type of Deployment or ReplicaSet instead?

helinwang commented 7 years ago

Scaling Metrics

Background

We can scale jobs (jobs in the general sense, not the Kubernetes Job) by:

the Horizontal Pod Autoscaling (HPA), or
manually by a custom server.

Both approaches requires scaling metrics.

HPA support per Pod CPU and memory metrics out-of-the-box. Custom metrics can be collected with some effort.

The CPU and memory usage is probably not a good indicator for scaling trainers, since each trainer's memory and CPU usage is almost constant since the training process is a periodic iterative calculation process.

Inference Job Scaling Metrics

We need to scale inference jobs based on the load for each Pod that runs the inference server. Query Per Second (QPS) is a good indicator.

Training Job Scaling Metrics

The auto-scaling for training jobs should be based on the "cluster resource usage" (i.e., number of GPUs that can be elastically scaled, the resource-requirement-pressure of other job types such as inference job).

Plan

Currently HPA does not support Kubernetes Job that the trainer users, and the scaling API changed since Kubernetes 1.6. Perhaps currently the best thing to do is to scale manually by a custom server.

helinwang commented 7 years ago

After some research, maybe one good way of abstracting the training/inferencing job is to create a custom resource (Job and Deployment are resources) and a custom controller for that resource.

The custom resource specifies the training/inferencing configuration (e.g., how may GPU trainer minimum/maximum, how many pservers minimum/maximum). The minimum/maximum values are for auto-scaling.

The custom controller coordinates all training/inferencing jobs. It knows how many GPU nodes are available, so it can dynamically scale all training/inferencing jobs accordingly.

References: https://resources.coreos.com/youtube-coreos-fest-2017/writing-a-custom-controller-extending-the-functionality-of-your-cluster https://coreos.com/blog/introducing-operators.html https://github.com/kubernetes/community/blob/master/contributors/devel/controllers.md https://coreos.com/blog/custom-resource-kubernetes-v17

helinwang commented 7 years ago

Design Doc: Horizontal Autoscaling: https://github.com/PaddlePaddle/cloud/pull/380

PaddlePaddle / PaddleCloud