Open Padarn opened 3 years ago
Hey @Padarn , Regarding point 2,3 I'm preparing a small PR that can tell the operator to take savepoint before cluster restart on upgrade (this is good regardless of the automatic part). Regarding the autoscaling capability, this might be a nice idea, but this can also be a separete component (that communicates with the operator) flink scaling is a little bit complicated, and an approach that scales up a cluster based on cpu metrics alone can have no impact or even negative impact on some clusters.
Hey @shashken, thanks for your response. You make a good point, it can certainly be a separate component, this would be much cleaner.
Let me know when you have a PR ready, would be keen to review to get more exposure to the operator layout.
@Padarn Done - https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/392
ShouldTakeSavepointOnUpgrade
is the flag I added. that was actually the smallest change in the PR, I fixed another savepoint issue to increase the savepoint feature stability
@Padarn, thanks for the proposal. I like the idea of adding auto scaling as a feature of the operator, it should be just declarative for end users, a separate component would be more complex for them to use. It would be nice if you can contribute a PR, thanks!
thanks @shashken
@functicons would be happy to work on a PR. Perhaps as there are some differing opinions on this, I will first create a POC version and ask for a review. If everyone is aligned I can clean it up to be merged.
I'll take a go at this in the weekend.
Hi guys. I looked into this a bit and I see there as being two options.
The second option seems easy enough, but it does mean reimplementing a lot of functionality that already exists in the HPA. So if it were possible to use the HPA I think that might be better.
Open to any thoughts.
Haven't reviewed your PR, but I prefer option 1, if we can reuse HPA, it will be easier to maintain.
Yeah I tend to agree. I will try adding an HPA to the resources to the resources run by the operator. I need to look at some other examples of operators to see how they handle optional resources.
Hi @functicons I've updated the PR to create an HPA along with the operator itself. The MR is not fully tested yet, but as an example of how it would look given our discussion above.
To give some detail on how the process would work:
scaleTargetRef
as this is the TaskManager stateful set of our operator.FlinkCluster
from which it should use the new /scale
subresource to set the cluster spec (note that this means scaling the CRD spec, so the same reconcile loop for FlinkCluster
updates is followed.Notes:
autoscaling/v1
, not the newer autoscaling/v2beta1
. Can update this if the overall approach is agreed on.
Hi all,
I'm interested in autoscaling support for the operator. You can see that ververica platform supports this now, the mechanism I gather from their post is:
With perhaps 2 and 3 swapped to ensure smaller downtime.
I'd be interested in adding this if it is something others would use. Any thoughts on the idea? It seems pretty straightforward to attach an HPA and use it's 'target replicas' to modify the desired number of taskmanagers.