digitalocean / clusterlint

A best practices checker for Kubernetes clusters. 🤠
Apache License 2.0
542 stars 45 forks source link

Webhook with a timeoutSeconds greater than 29 seconds will block upgrades? #167

Open kyrofa opened 5 months ago

kyrofa commented 5 months ago

I'm trying to upgrade from dok8s v1.28 to v1.29, but clusterlint is telling me that cert-manager's webhook, which uses a timeoutSeconds of 30, will block the upgrade. Would someone mind filling me in on why that's the case suddenly? I've been using cert-manager for years, across several k8s versions, without issue. I see in the upstream docs where it says:

The timeout value must be between 1 and 30 seconds.

But it's unclear whether that's inclusive or not. Clusterlint has seemingly decided that it's 1-inclusive and 30-exclusive. Is that actually based in reality? Is there another document I've missed? 29 seems so random, haha!

For now I will customize my deployment and set this to 29 to make clusterlint happy, but I'd like to get to the bottom of this. I've logged an issue against cert-manager as well. If clusterlint is correct, cert-manager should probably make the default value 29. On the other hand, if 30 is indeed a valid value, clusterlint probably shouldn't complain about it.

kyrofa commented 5 months ago

I just tried changing timeoutSeconds to 31 and got this:

Error: cannot patch "cert-manager-webhook" with kind MutatingWebhookConfiguration: MutatingWebhookConfiguration.admissionregistration.k8s.io "cert-manager-webhook" is invalid: webhooks[0].timeoutSeconds: Invalid value: 31: the timeout value must be between 1 and 30 seconds && cannot patch "cert-manager-webhook" with kind ValidatingWebhookConfiguration: ValidatingWebhookConfiguration.admissionregistration.k8s.io "cert-manager-webhook" is invalid: webhooks[0].timeoutSeconds: Invalid value: 31: the timeout value must be between 1 and 30 seconds

This is leading me to believe 30 should really be a valid value. Happy to make a PR if folks agree.

kyrofa commented 5 months ago

@maelvls provided some additional details on the issue I logged against cert-manager:

There is this comment in kubelint's code:

Webhooks with TimeoutSeconds set: less than 1 or greater than or equal to 30 is bad.

and

TimeoutSeconds value should be set to a non-nil value (greater than or equal to 1 and less than 30). If the TimeoutSeconds value is set to nil and the cluster version is 1.13.*, users are unable to configure the TimeoutSeconds value and this value will stay at nil, breaking upgrades. It's only for versions >= 1.14 that the value will default to 30 seconds.

That combined with my experimentation above makes it pretty clear that 30 seconds is actually a valid value, and clusterlint should allow it.

timoreimann commented 5 months ago

I vaguely remember there was a "conflict" between 30 seconds used by a webhook and the default 30 seconds timeout of kubectl: if a webhook used a 30 seconds timeout, the default timeout from kubectl could "override" the webhook one, rendering failure policies of Ignore ineffective.

@varshavaradarajan does that make sense, or am I completely off here?

elcfd commented 2 months ago

Hi - could you please let us know the reason why the upper limit of this value has been set to 29?

Seeing the above mentioned issue on both our Digital Ocean k8s clusters which run the cert manager helm chart.

Thanks,

Charlie

kyrofa commented 2 months ago

It sounds like the reason is almost lost to the dust of time. If I may, I recommend following shellcheck's model: not only point out problems, but write quick blurbs describing WHY it's a problem, and ways to fix it. That will be helpful not only to clusterlint users, but clusterlint maintainers over time.

timoreimann commented 2 months ago

There should be documentation on every check, but as usual it can probably be improved.

I am reasonably sure that what I wrote at https://github.com/digitalocean/clusterlint/issues/167#issuecomment-2065821917 about the upper limit is true. Should be fairly easy to verify and afterwards improve / adjust the documentation one way or another.

elcfd commented 2 months ago

@kyrofa I contacted Digital Ocean support as I am a customer and this is their response/reasoning:


I completely understand your concern here. I will be happy to assist you. We recommend ensuring the timeout is less than 30s. This is because 30s is the default timeout for Kubernetes objects and it is possible to create a scenario that lines up perfectly with other timeouts which would allow the timeout to fail indefinitely. For more details refer below link:

https://docs.digitalocean.com/support/clusterlint-error-fixes/#admission-controller-webhook-timeout

For example, if a node hosting the webhook goes down, then webhook will not respond to requests. This is problematic when nodes try and register to the master API and the master API cannot reach the webhook. The node registration fails (because the failurePolicy is set to 'Fail') while waiting on the webhook to become available. But the webhook often can't reschedule till a new node has been created. Thus creating a bit of a circular dependency scenario.


elcfd commented 2 months ago

There should be documentation on every check, but as usual it can probably be improved.

I am reasonably sure that what I wrote at #167 (comment) about the upper limit is true. Should be fairly easy to verify and afterwards improve / adjust the documentation one way or another.

This matches what Digital Ocean support have told me 👍