Support for TaskManager Autoscaling

TsvetanKonstantinov commented 4 years ago

TaskManager Autoscaling is an important feature for production Flink workloads. Currently, I can't create a Horizontal Pod Autoscaler for the FlinkCluster custom resource. K8s provides support for autoscaling custom resources via the /scale subresource.

TsvetanKonstantinov commented 4 years ago

Below are some notes on how to support the k8s Horizontal Pod Autoscaler by adding the /scale subresource to the FlinkCluster CRD:

SpecReplicasPath should point to the spec/taskManager/replicas property of the CRD. For Job Clusters, updating spec/taskManager/replicas should result in an update of the spec/job/parallelism property because both are needed for successful down/up scaling of a Flink job. For session clusters, the autoscaler will scale the cluster, not the jobs, so only the TM replicas field can be adjusted. The documentation should mention that provided values for spec/taskManager/replicas (and spec/job/parallelism in the case of JobCluster) will be ignored if an autoscaler is configured.

StatusReplicasPath should point to a new status/taskManager/replicas property of the CRD that contains the current (actual) number of replicas of the TM deployment. Currently, this field is not present in the CRD status section.
LabelSelectorPath should point to the pods of the TM deployment. This way, the autoscaler will watch the CPU utilization of the TM pods and update the spec/taskManager/replicas field of the CRD accordingly. This will trigger an update of the cluster/job with the new TM replicas count (and possibly the new job parallelism that is equal to the number of replicas).

A guide on adding the scale endpoint to custom resources.

functicons commented 4 years ago

Sounds like a great feature to add. IIUC, it leverages the update cluster feature we added recently when the autoscaler decides TM should be scaled.

But I have a concern about the compatibility with Flink's native k8s integration. Eventually this operator will migrate to that model. Does HPA still make sense with that?

TsvetanKonstantinov commented 4 years ago

Sounds like a great feature to add. IIUC, it leverages the update cluster feature we added recently when the autoscaler decides TM should be scaled.

Yes, it leverages the update cluster feature.

But I have a concern about the compatibility with Flink's native k8s integration. Eventually, this operator will migrate to that model. Does HPA still make sense with that?

The concern is valid.

Do you mean this (Flink reactive mode)? When implemented, Flink will be able to react to newly started TaskManagers automatically. Theoretically, once Flink delivers this feature, you should be able to add an HPA to the TaskManager deployment.

It would still make more sense to add the HPA to the FlinkCluster custom resource instead. The only change of the logic would be that the operator should not trigger an update if "reactive" mode is enabled but simply start/stop TM replicas and let Flink do the work.

Having the scale sub-resource on the CRD will open the door for additional features. E.g. it is not clear how "reactive mode" will behave for Session Clusters. The operator can support auto-scaling only some jobs that are running on a session cluster based on their resource utilization.

functicons commented 4 years ago

Sounds good. Go ahead adding this feature, I will review your PR once it is ready.

jmorcar commented 3 years ago

Is there any update for this improvement?

GoogleCloudPlatform / flink-on-k8s-operator

Support for TaskManager Autoscaling #284