aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 320 forks source link

[EKS] [request]: Cluster guardrails and conformance packs #1949

Open bryantbiggs opened 1 year ago

bryantbiggs commented 1 year ago

Community Note

Tell us about your request Kubernetes clusters tend to span multiple personas with varying degrees of knowledge on Kubernetes recommended practices. This leads to scenarios where cluster administrators create either a cluster provisioning pipeline, cluster-as-a-service, namespace-as-a-service, or an internal "golden template" of how the cluster(s) should be created and configured. However, this requires a significant amount of upfront investment for administrators to work out the automatons, tooling, and configurations before they can start vending clusters to teams.

Instead, I would like to propose the idea of cluster guardrails and conformance packs, similar to that of AWS Control Tower. With cluster guardrails, administrators are able to quickly and easily prescribe a number of common recommended practices across their accounts or organizations (similar to Control Tower - users can scope guardrails to specific accounts or organizational units).

Some example guardrails could include:

In addition, with conformance packs similar to that of AWS Config, users can enable various packs such as reliability, security, observability, etc., to report any findings back to AWS Config. This is important due to the fact that users will need to first find and report on specific findings in order to remediate prior to enabling guardrails; or users may wish to just report on certain aspects but not make them a hard requirement. The use of AWS Config allows for centralized reporting from across a number of accounts and organizations, in addition to the historical changes of those resources over time (compliance - reporting on when changes were made and by whom)

The guardrail implementation would likely come in the form of admission webhooks where if a team tried to deploy a workload that did not meet the guardrails, the request would be denied and the workload would not be created on the cluster.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Cluster administrators are constantly facing challenges such as - knowing what is running on the cluster, determining if those workloads are configured for high availability to minimize or avoid disruptions, determining the state of their cluster compliance (i.e. - regulated industries that need to meet PII, PHI, PCI DSS, etc. type compliance requirements as well as internal business compliance requirements). In addition, determining the current state of their clusters is only half of the battle - they still have to remediate or work with a number of teams to get the findings remediated. To help mitigate this, they often turn to popular open source tools such as OPA Gatekeeper or Kyverno. However, this requires them to install these tools on each cluster, ensure users are not able to modify the services or rules, consistently update the services, etc.

Moving this reporting and enforcement over to the AWS side - the heavy lifting of installing/maintaining/etc. is now removed, and they can simply configure account/organization wide rules that will prevent a number of misconfigurations or poor practices from ever entering their fleet of clusters.

Why is this useful besides removing the burden of installing and maintaining the rules enforcement service?

  1. Administrators are able to now safely delegate more responsibility to teams creating/using clusters across a number of accounts without worrying about misconfigurations or the removal of the services providing the enforcement. They simply set the requirements at the account/organization level, and all downstream users must comply or else their associated resources will not be provisioned on the cluster
  2. The process of performing cluster upgrades is made easier by knowing that certain configurations are required for workloads to be created on the clusters (i.e. - pod disruption budgets are required, minimum number of replicas are required, statefulsets are not allowed, probes are required, etc.). This eliminates a whole host of issues prior to performing cluster upgrades
  3. Ensuring clusters are in compliance is simplified and more strongly enforced now that the enforcement is outside of the users reach
  4. With conformance packs, administrators are able to create aggregated reports or even build custom automations based on any conformance findings

Are you currently working around this issue? Customers today generally use either OPA Gatekeeper or Kyverno, or similar tooling to enforce rules and standards

Additional context

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

stevehipwell commented 1 year ago

I don't see how this request fits into the EKS service, a managed Kubernetes control plane, in a way which fits both the architecture and shared responsibility model. I can see some value in the aims but I can't see how it would be practical to implement without making some significant compromises either on the AWS side (more potentially unbounded compute in the control plane), on the users side (having to follow a common cluster architecture), or probably on both sides. It might be a valuable ancillary service but for this use case I'd assume most competent cluster operators would be rolling their own, generally platforms with this level of coupling to the business need to be highly bespoke or end up as a thin abstraction and hide little complexity.

If AWS are interested in making it easier to consume EKS clusters I'd suggest getting Karpenter into the control plane should be the first step as it'd give you a native KRM method for provisioning compute.

bryantbiggs commented 1 year ago

I'm not sure I fully follow but happy to clarify any parts to better iron out the details of the proposal.

Are you saying that having organization wide controls that allowed you to reject workloads that do not set readiness probes, PDBs, etc., from being created would not be worthwhile to you? What about organizations that have a self-serve operating model and/or hundreds/thousands of clusters - wouldn't guardrails provide some level of usefulness without having to implement them yourself on each cluster?

joebowbeer commented 1 year ago

Given that Pod Security Admission (PSA) is the standard in this area, enabled by default in k8s 1.25, and that if more control is needed, security best practices generally recommend installing a more advanced validating admission controller, such as Kyverno or OPA Gatekeeper, I'm wondering how much EKS should address.

Many of the guardrails listed in the description will require a more advanced controller, such as Kyverno or OPA Gatekeeper.

I think a reasonable first step would be for EKS to provide some PSA configurability during cluster creation.

stevehipwell commented 1 year ago

Are you saying that having organization wide controls that allowed you to reject workloads that do not set readiness probes, PDBs, etc., from being created would not be worthwhile to you? What about organizations that have a self-serve operating model and/or hundreds/thousands of clusters - wouldn't guardrails provide some level of usefulness without having to implement them yourself on each cluster?

@bryantbiggs I'm saying there are two major issues with this approach. Firstly you're going to need compute to run this, EKS clusters don't have any compute by default so this would need to either be in the control plane or handled another way; and it could get expensive even without intentional abuse (there is a reason Kubernetes is adopting CEL for native validation). Secondly you're either going to need to be very constrained and limit your audience or you're going to need to create a complex DSL to allow the level of customisation that everyone is bound to need in the fullness of time; end users would be better off picking a policy engine and learning their DSL instead of either getting locked into an oversimplified service or learning a vendor DSL.

I do think that this kind of functionality could be supported in a sustainable way with the following items.

bryantbiggs commented 1 year ago

[Additional] Add support for ignoring guardrail/check at the namespace level. For example - resources deployed in temp/test/dev/preview namespaces may not be configured for HA or "best practices" in order to reduce the amount of resources (and cost). This is the desired configuration, but would violate the guardrail/check unless the ability to ignore the namespace(s) was provided