GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
658 stars 266 forks source link

Support for TaskManager Autoscaling #284

Open TsvetanKonstantinov opened 4 years ago

TsvetanKonstantinov commented 4 years ago

TaskManager Autoscaling is an important feature for production Flink workloads. Currently, I can't create a Horizontal Pod Autoscaler for the FlinkCluster custom resource. K8s provides support for autoscaling custom resources via the /scale subresource.

TsvetanKonstantinov commented 4 years ago

Below are some notes on how to support the k8s Horizontal Pod Autoscaler by adding the /scale subresource to the FlinkCluster CRD:

A guide on adding the scale endpoint to custom resources.

functicons commented 4 years ago

Sounds like a great feature to add. IIUC, it leverages the update cluster feature we added recently when the autoscaler decides TM should be scaled.

But I have a concern about the compatibility with Flink's native k8s integration. Eventually this operator will migrate to that model. Does HPA still make sense with that?

TsvetanKonstantinov commented 4 years ago

Sounds like a great feature to add. IIUC, it leverages the update cluster feature we added recently when the autoscaler decides TM should be scaled.

Yes, it leverages the update cluster feature.

But I have a concern about the compatibility with Flink's native k8s integration. Eventually, this operator will migrate to that model. Does HPA still make sense with that?

The concern is valid.

Do you mean this (Flink reactive mode)? When implemented, Flink will be able to react to newly started TaskManagers automatically. Theoretically, once Flink delivers this feature, you should be able to add an HPA to the TaskManager deployment.

It would still make more sense to add the HPA to the FlinkCluster custom resource instead. The only change of the logic would be that the operator should not trigger an update if "reactive" mode is enabled but simply start/stop TM replicas and let Flink do the work.

Having the scale sub-resource on the CRD will open the door for additional features. E.g. it is not clear how "reactive mode" will behave for Session Clusters. The operator can support auto-scaling only some jobs that are running on a session cluster based on their resource utilization.

functicons commented 4 years ago

Sounds good. Go ahead adding this feature, I will review your PR once it is ready.

jmorcar commented 3 years ago

Is there any update for this improvement?