[EKS] [request]: Cluster guardrails and conformance packs

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request Kubernetes clusters tend to span multiple personas with varying degrees of knowledge on Kubernetes recommended practices. This leads to scenarios where cluster administrators create either a cluster provisioning pipeline, cluster-as-a-service, namespace-as-a-service, or an internal "golden template" of how the cluster(s) should be created and configured. However, this requires a significant amount of upfront investment for administrators to work out the automatons, tooling, and configurations before they can start vending clusters to teams.

Instead, I would like to propose the idea of cluster guardrails and conformance packs, similar to that of AWS Control Tower. With cluster guardrails, administrators are able to quickly and easily prescribe a number of common recommended practices across their accounts or organizations (similar to Control Tower - users can scope guardrails to specific accounts or organizational units).

Some example guardrails could include:

Requests/limits are specified
Readiness/liveness probes are specified
An associated pod disruption budget is specified
Minimum number of replicas required
Pod security context requirements
PV/PVCs are not allowed

In addition, with conformance packs similar to that of AWS Config, users can enable various packs such as reliability, security, observability, etc., to report any findings back to AWS Config. This is important due to the fact that users will need to first find and report on specific findings in order to remediate prior to enabling guardrails; or users may wish to just report on certain aspects but not make them a hard requirement. The use of AWS Config allows for centralized reporting from across a number of accounts and organizations, in addition to the historical changes of those resources over time (compliance - reporting on when changes were made and by whom)

The guardrail implementation would likely come in the form of admission webhooks where if a team tried to deploy a workload that did not meet the guardrails, the request would be denied and the workload would not be created on the cluster.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Cluster administrators are constantly facing challenges such as - knowing what is running on the cluster, determining if those workloads are configured for high availability to minimize or avoid disruptions, determining the state of their cluster compliance (i.e. - regulated industries that need to meet PII, PHI, PCI DSS, etc. type compliance requirements as well as internal business compliance requirements). In addition, determining the current state of their clusters is only half of the battle - they still have to remediate or work with a number of teams to get the findings remediated. To help mitigate this, they often turn to popular open source tools such as OPA Gatekeeper or Kyverno. However, this requires them to install these tools on each cluster, ensure users are not able to modify the services or rules, consistently update the services, etc.

Moving this reporting and enforcement over to the AWS side - the heavy lifting of installing/maintaining/etc. is now removed, and they can simply configure account/organization wide rules that will prevent a number of misconfigurations or poor practices from ever entering their fleet of clusters.

Why is this useful besides removing the burden of installing and maintaining the rules enforcement service?

Administrators are able to now safely delegate more responsibility to teams creating/using clusters across a number of accounts without worrying about misconfigurations or the removal of the services providing the enforcement. They simply set the requirements at the account/organization level, and all downstream users must comply or else their associated resources will not be provisioned on the cluster
The process of performing cluster upgrades is made easier by knowing that certain configurations are required for workloads to be created on the clusters (i.e. - pod disruption budgets are required, minimum number of replicas are required, statefulsets are not allowed, probes are required, etc.). This eliminates a whole host of issues prior to performing cluster upgrades
Ensuring clusters are in compliance is simplified and more strongly enforced now that the enforcement is outside of the users reach
With conformance packs, administrators are able to create aggregated reports or even build custom automations based on any conformance findings

Are you currently working around this issue? Customers today generally use either OPA Gatekeeper or Kyverno, or similar tooling to enforce rules and standards

Additional context

OPA Gatekeeper library https://open-policy-agent.github.io/gatekeeper-library/website/
Kyverno policies https://kyverno.io/policies/?policytypes=EKS%2520Best%2520Practices

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

I don't see how this request fits into the EKS service, a managed Kubernetes control plane, in a way which fits both the architecture and shared responsibility model. I can see some value in the aims but I can't see how it would be practical to implement without making some significant compromises either on the AWS side (more potentially unbounded compute in the control plane), on the users side (having to follow a common cluster architecture), or probably on both sides. It might be a valuable ancillary service but for this use case I'd assume most competent cluster operators would be rolling their own, generally platforms with this level of coupling to the business need to be highly bespoke or end up as a thin abstraction and hide little complexity.

If AWS are interested in making it easier to consume EKS clusters I'd suggest getting Karpenter into the control plane should be the first step as it'd give you a native KRM method for provisioning compute.

I'm not sure I fully follow but happy to clarify any parts to better iron out the details of the proposal.

Are you saying that having organization wide controls that allowed you to reject workloads that do not set readiness probes, PDBs, etc., from being created would not be worthwhile to you? What about organizations that have a self-serve operating model and/or hundreds/thousands of clusters - wouldn't guardrails provide some level of usefulness without having to implement them yourself on each cluster?

Given that Pod Security Admission (PSA) is the standard in this area, enabled by default in k8s 1.25, and that if more control is needed, security best practices generally recommend installing a more advanced validating admission controller, such as Kyverno or OPA Gatekeeper, I'm wondering how much EKS should address.

Many of the guardrails listed in the description will require a more advanced controller, such as Kyverno or OPA Gatekeeper.

I think a reasonable first step would be for EKS to provide some PSA configurability during cluster creation.

Are you saying that having organization wide controls that allowed you to reject workloads that do not set readiness probes, PDBs, etc., from being created would not be worthwhile to you? What about organizations that have a self-serve operating model and/or hundreds/thousands of clusters - wouldn't guardrails provide some level of usefulness without having to implement them yourself on each cluster?

@bryantbiggs I'm saying there are two major issues with this approach. Firstly you're going to need compute to run this, EKS clusters don't have any compute by default so this would need to either be in the control plane or handled another way; and it could get expensive even without intentional abuse (there is a reason Kubernetes is adopting CEL for native validation). Secondly you're either going to need to be very constrained and limit your audience or you're going to need to create a complex DSL to allow the level of customisation that everyone is bound to need in the fullness of time; end users would be better off picking a policy engine and learning their DSL instead of either getting locked into an oversimplified service or learning a vendor DSL.

I do think that this kind of functionality could be supported in a sustainable way with the following items.

Customisable PSA (as suggested by @joebowbeer), everyone is going to have at least one pod in their cluster needing to be exempt (log collector, CNI, etc)
Policy library (with Semver & targeting specific K8s versions) for the existing open source engines (OPA Gatekeeper & Kyverno spring to mind)
Audit capability (the security version of https://github.com/aws/containers-roadmap/issues/1946)

[Additional] Add support for ignoring guardrail/check at the namespace level. For example - resources deployed in temp/test/dev/preview namespaces may not be configured for HA or "best practices" in order to reduce the amount of resources (and cost). This is the desired configuration, but would violate the guardrail/check unless the ability to ignore the namespace(s) was provided

aws / containers-roadmap

[EKS] [request]: Cluster guardrails and conformance packs #1949

Community Note