Do not create probe for control-plane-as-a-service shoots

unmarshall commented 1 year ago

How to categorize this issue?

/area control-plane /kind enhancement /priority 3

What would you like to be added:

Gardener Issue#7635 introduces control plane as a service concept where number of workers will be 0 and number of control plane components will also be reduced. MCM and CA will not be deployed and there will also be no need to scale up/down KCM as there is no workload that is scheduled (as there are no nodes).

This enhancement optimises DWD and prevents creation of probes if for a Cluster the number of workers = 0.

Why is this needed:

Currently DWD has a single configuration of dependent services for the prober which is applicable to all shoot control namespaces in the seed. This configuration is contained in a ConfigMap deployed in the garden namespace of the seed. For CPAAS (control plane as a service), there will be no deployments created for MCM and CA and there is no need to scale down KCM. Therefore for the prober there is nothing to do for these CPAAS namespaces.

ashwani2k commented 1 year ago

I was wondering if this will get implicitly handled by having shoot specific configs for DWD. As then a missing config for a shoot will ignore that shoot from DWD probing. Enhancement to usage story for -- https://github.com/gardener/dependency-watchdog/issues/79

unmarshall commented 1 year ago

I was wondering if this will get implicitly handled by having shoot specific configs for DWD. As then a missing config for a shoot will ignore that shoot from DWD probing. Enhancement to usage story for -- https://github.com/gardener/dependency-watchdog/issues/79

There are 2 ways to do this:

As discussed with Vedran and also captured in #79 one way is to have shoot specific configuration. There are two ways to do this:
1. Continue with a default DWD probe configuration that is created in garden namespace - which is the case today. In the individual shoot control namespaces, consumers/operators can overwrite one or more configuration parameters by explicitly creating a configuration-override. The resultant configuration to be used for that shoot will be a merge of default with configuration-override thus preventing repeating the same centrally created configuration again and again. Benefit is that if there is a change in a few defaults it will then be uniformly applied to all shoot control namespaces (while leaving the overridden values) without individually also changing configmap for all shoot control namespaces.
2. A simplified implementation approach where per shoot namespace one defines a configmap for the prober component of DWD.

Few points to consider/ponder:

Till today we have not felt the need to duplicate the prober configmap for each shoot control namespaces. It is also not clear, if different shoots managed by a seed, will have the need to have different node-monitor-grace-period duration thus requiring the need to then also have different probe timeouts.
Will it just be easier to already use len(workers) == 0 predicate to identify if a probe should be create for a shoot or base it on absence of prober configuration in the shoot control namespace?

Therefore it is not final/clear if #79 and this issue will have the same solution.

vlerenc commented 1 year ago

Considering the "points to ponder", my two cents are:

If we change the default of the node-monitor-grace-period back to 40s, probably nobody will lower it even more, but even if they do, DWD is probably no longer useful then, because undercutting an even shorter node-monitor-grace-period is too dangerous as it might scale down the components to aggressively/often. So, I think, we do not have to care about clusters with even lower node-monitor-grace-period, because we cannot make DWD react even more aggressive without risking detrimental effects. Ergo, no need to do anything here anymore and we can stick with the central configuration.
I would indeed say, let's just go with something very simple like len(workers) == 0. That's the mechanism that suppresses also the deployment of all these other components like MCM, CA, KSCH, etc., so why shouldn't it be the same trigger/condition that suppresses DWD from acting? I would think, it should, which would make this task much simpler.

But the above aside, there is no real need to suppress DWD at all as long as it doesn't try to scale up MCM/CA in a nodeless cluster or fails because their deployments are missing. Why shouldn't it be watching also these control planes or what's the harm for the rest of the functionality (KCM, ETCD<-KAPI)? It just shouldn't fail, but whatever it can do, it can continue to do also for these clusters, no?

unmarshall commented 1 year ago

Why shouldn't it be watching also these control planes or what's the harm for the rest of the functionality (KCM, ETCD<-KAPI)? It just shouldn't fail, but whatever it can do, it can continue to do also for these clusters, no?

Bringing down KCM when KAPI is unavailable is not really required as there are no nodes and therefore no prevention of a meltdown is required to be one. So if tomorrow we see a LOT of such control planes in a seed then we will unnecessarily create long running go-routines (one per shoot namespace) and they will really not do anything meaningful that will be helpful to the end-user.

gardener / dependency-watchdog

Do not create probe for control-plane-as-a-service shoots #80