Open relu opened 3 years ago
Kyverno can be used for this, example here: https://github.com/fluxcd/flux2-multi-tenancy/blob/main/infrastructure/kyverno-policies/flux-multi-tenancy.yaml
Nice, I'm not familiar with Kyverno, but looks really good in the context of multi-tenancy configurations.
I would argue for doing this validation at the level of the controller codebase, in an actual admission webhook, since it's something I'd consider "core" logic, not sure if that makes sense... What I mean to say is that maybe we shouldn't rely on a 3rd party solution to enforce a validation policy for a resource that would render itself useless if an invalid spec is provided. I see Kyverno's policies as a layer above that "core" logic, meaning the controllers would run according to design just fine without it, lacking only the protections imposed at the organization management level.
I'm for reusing a CNCF project than writing something from scratch. Dealing with admission webhooks is hard, we need to setup TLS certs signed by Kubernetes API or depend on cert-manager to install Flux. Having cert-manager as an install dependency is really messy as it creates a chicken-egg problem.
I see your point, indeed it would probably be less of a hassle. I was more inclined in the direction of admission webhooks because of them being native to k8s, however, considering the drawbacks of having do deal with the certificate bundle, be it either through an init container or through cert-manager (which is trivial if you have it, but complicated otherwise), I suppose going the Kyverno route would make more sense, especially if there's already a use case for it in the context of supporting multi-tenancy configurations. Would this make Kyverno a hard dependency for Flux? If so, how do we manage it?
I don't think we should rely on cert-manager for Flux certs, as this means users have to install cert-manager by hand before Flux. The problem with Kyverno is that it doesn't support CRD conversion hooks and we'll need that when we'll release a new API version. I think we should investigate how can we generate certs at Flux install time and how can users rotate those certs before they expire.
Does that mean admission webhooks are still on the table?
For reference, I've found this article to answer the question about how certificate management for admission webhooks can be achieved. See the last part regarding certificate rotation (obviously it's a bit of a hurdle).
I agree a hard requirement for cert-manager would not be a good solution, but I think we should at least let users make use of it if they actually have it and maybe even recommend it, or otherwise, by default, rely on our potential built-in solution. ECK takes this approach, see their documentation on its configuration.
but I think we should at least let users make use of it if they actually have it and maybe even recommend it
How could they have cert-manager running if we tell people to do GitOps?
Does that mean admission webhooks are still on the table?
Yes if we can figure out how to generate the certs at Flux install time without requiring users to install things with kubectl beforehand. Both Kyverno and OPA Gatekeeper are generating certs at startup, so there is a way to do this without forcing people into deploying cert-manager.
How could they have cert-manager running if we tell people to do GitOps?
Hybrid configurations, such as using the terraform helm provider to install parts of the kubernetes extension components, cert-manager would be a good candidate for that. I'm not saying it's necessarily the right way, neither arguing around the conflicting instructions for users, but it can be a valid use case.
I looked a bit into how Kyverno does it and it does seem to come with some overhead. I see they actually implemented means to support certificate expiration validation and re-issuing but it does not seem to be used actively anywhere, so I assume the responsibility lies on the user to rotate the certificate at the moment (delete the secret and restart kyverno?).
I also checked on how ECK is doing it and it seems their method is a bit more simplistic and they actually have covered active renewal using a webhook certificates controller which checks if the certificate needs to be renewed and handles it.
If we plan to make use of the same strategy, I think the right place for this implementation would be under pkg
as a generic package that could be integrated and used in all other places.
This discussion is becoming more relevant to other controllers, now that I have discovered Artifactory is actively guiding people to use invalid Helm chart names. Something that could be blocked in a UX friendly manner using a validation webhook API in the controller, as it would require rule based validation on the HelmChart
object for the Chart
field based on the SourceRef.Kind
that is configured.
OPA has made available a Go module for generating and rotating webhook certs that works with controller-runtime https://github.com/open-policy-agent/cert-controller
Quick update on this: I managed to get to it and wanted to see if cert-controller
can be integrated. It looks like it's behind with controller-runtime
(v0.6.3) and is incompatible with the current v0.8 version in use by our controllers. Not sure what's the right approach, in this case. Probably developing and maintaining our own shared module in pkg
is the way to go. This would mean we have full control over any future controller-runtime
updates, essentially preventing potential dep blockers from upstream.
Knative went down the path of having a custom controller to reconcile webhook certificates: https://github.com/knative/pkg/tree/5708c4c442327025bc26eed03813dc10946d5a50/webhook/certificates
I'm running Tekton on my clusters and see that webhook crashing quite often....
We should create an admission webhook for validating the policies. This makes sense in the context of
ImagePolicyChoice
which should configure one and only one policy type. An image policy that lacks a validImagePolicyChoice
should be considered invalid and fail to create in this situation.