CA-MCM overhaul - Githubissues

himanshu-kun commented 1 year ago

Reason Discussion:

Currently there are some CA-MCM interaction issues, which we want to fix. One solution is to change the entire CA-MCM working which is seen currently. This issue is to discuss the feasability of such approaches

Terms for the discussion (to avoid confusion):

k/CA = kubernetes CA g/CA = gardener CA (fork of k/CA) new-CA = new CA code we'll implement which could be a component or a library

Dimensions of discussion:

1) Possible Goals 1) Use new-CA as a library inside MCM, new-CA library is just recommending and MCM is deciding. Currently g/CA has a binding recommendation

Ditch entire g/CA , design, implement from scratch. Basically leverage more kube-scheduler predicates directly
Get rid of node-groups
Benefit
- Can support more than 1000 nodes as CA only supports
- Can fit more pods on the nodes
2) Leverage current k/CA
Combine MCM into g/CA, so CA runs MCM controller, and ditch current MCM controller completely
we still maintain the fork, but the aim is to leverage the current features and community support with upstream offers
Benefit:
- solves MCM is down and CA is up kind of issues
- Targeted removal of machine can be easier 2) High Demand stories (which use current design)
  - https://github.com/gardener/autoscaler/issues/227
  - https://github.com/gardener/autoscaler/issues/154
  - other relatively smaller bugfixes list in CA-MCM board 3) Impact of overhaul to deal with current problems
  - What current CA functionality which are unpleasant (need to verify them)
Kube-scheduler config can be different from CA imported scheduler code
Limitation of 1 machine type per node grp
Many CLI flags in k/CA which could confuse customer
Can’t handle waitForFirstConsumer PVs
Increase utilisation of seeds , but doesn't seem to be done with current CA
Scale-down treated secondary, Scale-up treated as primary goal
- Scale-down not supported in same RunOnce() flow, if scale-up happened / until it happens, or scale-down in cool-down 4) Time required to be invested (excluding any time spent on current design and other dev tasks)
  - 1 yr min. 5) Maintenance effort, Support
  - need to deal with all the issues(verifying them), implementing them even if they are provided by k/CA
  - community support will be lost 6) Rollout strategy (if implementing)
  - keeping the current design running , and deploying MCM with recommendary CA (Goal 1) and compare the recommendations

himanshu-kun commented 1 year ago

/assign @elankath @unmarshall @rishabh-11

himanshu-kun commented 1 year ago

/assign

vlerenc commented 1 year ago

I was wondering whether we can leverage the "ground truth" a.k.a. the kube-scheduler more directly (no CA at all), e.g.:

Have "simulated" (non-existing) nodes and provide the machines after the kube-scheduler scheduled non-daemonset pods to it and then move the pods over once the machine has joined the cluster, but that's probably too ugly/visible to the end users
Run a second kube-scheduler that accesses a restricted KAPI proxy that proxies the real KAPI in which these fake nodes are added (the rest is similar to above: once the machine comes up and joins the cluster the pods are moved to the real nodes and the fake one remains "open"; if a fake node is full, more fake nodes are added, in case more capacity is required).

The point is, every simulation that is not based on the real kube-scheduler will be flawed, so why not find a way to trick it into doing what we need instead of an approximation (like the CA tries)?

elankath commented 1 year ago

Running a second kube-scheduler (can run in-process) that operates on a simulated model is a pretty nice idea. It would also alleviate the implementation efforts for such a large task. Though do we have any gardener customers that run custom schedulers ?

vlerenc commented 1 year ago

@elankath I don't know, but the Kubernetes or Gardener CA or our own simulation would also not match such a custom scheduler. The reason we introduced kube-scheduler configurability (e.g. bin-packing) was because even large teams didn't want to run control plane components by themselves. I don't think we need to consider those cases and if, then we probably shouldn't consider an automated solution and rather provide them with an API so that they can provision and deprovision nodes themselves. They have then to build the bridge between their scheduler and our API themselves, but unless somebody asks, I wouldn't even consider that. I have difficulties imagining, many/some/anybody running their own kube-scheduler.

gardener / autoscaler

CA-MCM overhaul #251