digitalocean / clusterlint

A best practices checker for Kubernetes clusters. 🤠
Apache License 2.0
547 stars 45 forks source link

Admission control webhook check in the "basic" group #92

Open adamwg opened 4 years ago

adamwg commented 4 years ago

We currently have a variety of webhook checks in the doks group, since various webhook configurations can be problematic for DOKS upgrades. However, there are also some generic best practices around admission control webhooks, documented in the upstream docs. For example, it's a generic best practice to set timeouts to small values (definitely less than 30 seconds, since that's the default apiserver request timeout).

We should build some generic webhook best practice checks that can be included in the basic group as well as the doks group.

dlemel8 commented 4 years ago

Hey @adamwg , I want to do this. Can you guide me a bit? Thanks.

adamwg commented 4 years ago

@dlemel8 Absolutely!

You can start by taking a look at the admission-controller-webhook-timeout check that we currently have in the doks group. I think it already implements the basic logic we would want for a timeout check in the basic group, so a first step would be to move it to the basic/ directory, and add basic to the list of groups it's registered in. (It should also stay in the doks group, since it is a DOKS pre-upgrade check.) The wording in the check diagnostics will need an update to be more generic.

I think there are at least two other checks that could be implemented in the basic group. These are both currently covered by the admission-controller-webhook-replacement check in the doks group, but should be broken out into their own checks:

  1. Checking that a webhook won't block its own service from running.
  2. Checking that a webhook doesn't apply to the kube-system namespace.

If those conditions are broken out into their own checks, they could be added to the doks group as well, and we could make the larger existing check smaller.

dlemel8 commented 4 years ago

@adamwg great, thanks! I will start working on it soon :)

dlemel8 commented 4 years ago

@adamwg just to make sure I understand correctly (regarding to admission-controller-webhook-replacement refactoring):

  1. currently the code will generate a diagnose only if the webhook is applied to system namespace and one of two is true: this is also the namespace of the webhook service this is not the namespace of the webhook service but the number of nodes is 1 do we want to just extract this logic to a new check or do we want to create stricter checks (for example create a diagnose if the webhook is applied to system namespace without any condition)?
  2. currently the code contains 5 checks that must all fail to generate a diagnose. if we remove some of them (for example the system namespace one) we will weaken the overall check and will generate a diagnose on a service that today is ok. is that what you meant?
adamwg commented 4 years ago

@dlemel8 Sorry for the delay getting back to you on this.

I think for the new check(s) in the basic group. we want to omit some of the existing conditions. The checks I'm proposing are based on the upstream guidance around webhook configurations, so they should be quite strict and would generate warnings for some configurations that are currently considered OK.

dlemel8 commented 4 years ago

@adamwg OK, that's answer my question about the new checks on basic group. What about the question about the current admission-controller-webhook-replacement ? if we remove some of its conditions we will warn about a service that now is OK