gardener / dependency-watchdog

This controller checks the status of etcd and restarts control plane components which are in a state of crashloop-backoff over an extensive period of time.
Apache License 2.0
4 stars 28 forks source link

Do not create probe for control-plane-as-a-service shoots #80

Closed unmarshall closed 1 year ago

unmarshall commented 1 year ago

How to categorize this issue?

/area control-plane /kind enhancement /priority 3

What would you like to be added:

Gardener Issue#7635 introduces control plane as a service concept where number of workers will be 0 and number of control plane components will also be reduced. MCM and CA will not be deployed and there will also be no need to scale up/down KCM as there is no workload that is scheduled (as there are no nodes).

This enhancement optimises DWD and prevents creation of probes if for a Cluster the number of workers = 0.

Why is this needed:

Currently DWD has a single configuration of dependent services for the prober which is applicable to all shoot control namespaces in the seed. This configuration is contained in a ConfigMap deployed in the garden namespace of the seed. For CPAAS (control plane as a service), there will be no deployments created for MCM and CA and there is no need to scale down KCM. Therefore for the prober there is nothing to do for these CPAAS namespaces.

ashwani2k commented 1 year ago

I was wondering if this will get implicitly handled by having shoot specific configs for DWD. As then a missing config for a shoot will ignore that shoot from DWD probing. Enhancement to usage story for -- https://github.com/gardener/dependency-watchdog/issues/79

unmarshall commented 1 year ago

I was wondering if this will get implicitly handled by having shoot specific configs for DWD. As then a missing config for a shoot will ignore that shoot from DWD probing. Enhancement to usage story for -- https://github.com/gardener/dependency-watchdog/issues/79

There are 2 ways to do this:

  1. As discussed with Vedran and also captured in #79 one way is to have shoot specific configuration. There are two ways to do this:
    1. Continue with a default DWD probe configuration that is created in garden namespace - which is the case today. In the individual shoot control namespaces, consumers/operators can overwrite one or more configuration parameters by explicitly creating a configuration-override. The resultant configuration to be used for that shoot will be a merge of default with configuration-override thus preventing repeating the same centrally created configuration again and again. Benefit is that if there is a change in a few defaults it will then be uniformly applied to all shoot control namespaces (while leaving the overridden values) without individually also changing configmap for all shoot control namespaces.
    2. A simplified implementation approach where per shoot namespace one defines a configmap for the prober component of DWD.

Few points to consider/ponder:

Therefore it is not final/clear if #79 and this issue will have the same solution.

vlerenc commented 1 year ago

Considering the "points to ponder", my two cents are:

But the above aside, there is no real need to suppress DWD at all as long as it doesn't try to scale up MCM/CA in a nodeless cluster or fails because their deployments are missing. Why shouldn't it be watching also these control planes or what's the harm for the rest of the functionality (KCM, ETCD<-KAPI)? It just shouldn't fail, but whatever it can do, it can continue to do also for these clusters, no?

unmarshall commented 1 year ago

Why shouldn't it be watching also these control planes or what's the harm for the rest of the functionality (KCM, ETCD<-KAPI)? It just shouldn't fail, but whatever it can do, it can continue to do also for these clusters, no?

Bringing down KCM when KAPI is unavailable is not really required as there are no nodes and therefore no prevention of a meltdown is required to be one. So if tomorrow we see a LOT of such control planes in a seed then we will unnecessarily create long running go-routines (one per shoot namespace) and they will really not do anything meaningful that will be helpful to the end-user.