aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 320 forks source link

[EKS] [request]: Cluster init job to bootstrap new clusters #1666

Open bryantbiggs opened 2 years ago

bryantbiggs commented 2 years ago

Community Note

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? This is somewhat related to #923 (reliant on it actually) but I didn't want to muck up that feature request with additional noise since on its own its quite good and would be a great addition.

When provisioning new clusters, there are a number of steps that unfortunately have to be manually performed that are repetitious; especially when treating clusters as temporary/ephemeral (i.e. - blue/green cluster upgrades, sandbox clusters, etc.). An example would be:

  1. Provision cluster and a managed node group (at least one node so that the core service pods come up happy and the provisioning succeeds. There is an alternate route where you don't deploy a node group yet, make the necessary changes, and then deploy your node groups. Its 6 vs half a dozen in the end though regarding which route you take)

  2. Once the cluster and node group are deployed, the first step we'll look at is pod networking: a. One route is to remove the VPC CNI permissions from the node group and rely on an IRSA for providing the VPC CNI the necessary permissions where the VPC CNI is a managed addon. I cannot launch a node group without the VPC CNI permissions and rely only on the IRSA role for the addon because its a chicken vs the egg problem. I have to provision first with the node having the permissions, then come back and remove and rely on the IRSA only b. Another route is where I don't want to use the VPC CNI and I need to delete it and provision the desired CNI/configuration. Again, this can only be done manually since EKS makes the default assumption to use the VPC CNI (I assume for a good "out of the box" experience, which makes sense)

    • Note: the other managed addons that are provided by default can also fall within a similar pattern. In my experience, folks want to codify what goes into their cluster. Trying to codify something that was provisioned outside of your control is one of the core issues here and why #923 is tied directly to this topic
  3. Lastly, at this point we want to switch to our preferred cluster management solution which usually introduces another chicken vs. the egg situation. This primarily concerns folks who are following GitOps practices as well as folks who are trying to follow recommended practices by disabling the cluster control plane public endpoint.

    • GitOps is great for clusters that do not have a publicly accessible endpoint; their operators sit within the cluster and pull changes in from external manifest repositories, image registries, etc. (pull method vs push)
    • However, how do we first provision our GitOps operator to start this pull based process - I've seen hacks where clusters are provisioned with public endpoints enabled, push deploy the GitOps oeprator with a script, then remove the public endpoint access. Its a cumbersome process to say the least. This is where having a job that is again sitting within the cluster or at least within the same network (access) as the cluster, we can move that bootstrapping to this job that is executed once when creating a new cluster.
    • Once the GitOps operator is provisioned, all changes can be handled via the codified manifests that the operator monitors for changes

✨ The Request ✨

The ability to provide a cluster init job that is run once upon cluster creation (think user data, but for clusters).

The changes in #923 are required to make this work without race conditions. If default components are installed, its a race to get the default components removed from the cluster before nodes come up using them and the desired components installed (i.e. - CoreDNS, CNI, etc.). If they are not on the cluster to begin with, users are simply providing their desired configurations via the init job. The init job needs to run in a loop for those tools that are not designed as operators; but an operator (which itself is its own control loop) is the preferred approach. The longer form for ArgoCD is shown below - the preferred alternative would be to use the ArgoCD operator instead.

This init job also alleviates the shortcomings of running a Fargate only cluster where users have to manually intervene to configure CoreDNS for Fargate

This lifecycle roughly looks like the following:

stateDiagram-v2
[*] --> ControlPlane
ControlPlane --> InitJob
InitJob --> InitJobRetry
InitJobRetry --> InitJob
ControlPlane --> NodeGroup
ControlPlane --> FargateProfile
InitJob --> [*]
NodeGroup --> [*]
FargateProfile --> [*]
  1. Users provision a bare cluster. When they submit the cluster creation, they have the ability to provide an init job spec; see a rough example below
  2. The control plane will be provisioned as usual, VPC endpoints attached into client account, etc.
  3. Following creation of the control plane, the init job is kicked off which runs asynchronously to node group/profile creation. Ideally, users should employ an operator pattern for their bootstrapping mechanism of choice which has its own self-contained control loop, but the init job itself can also be configured to backoff and retry x number of times for up to y timeout (standard job specs within the manifest). The key will be how to properly surface failures to users when the init job does not succeed - because nodes will most likely fail to join the cluster as well in that scenario.
  4. Upon successful execution of the init job, nodes should join the cluster successfully and the cluster bootstrap process is complete. I am positive that there are other use cases that people employ in bootstrapping a fresh cluster; this is just the most common set of steps that I have encountered when creating clusters or helping clients with their clusters (i.e - what steps are folks taking for restoring persistent volumes or handling stateful sets, etc. when bootstrapping a new cluster)
apiVersion: batch/v1
kind: Job
metadata:
  name: clusterInit
spec:
  template:
    spec:
      containers:
        - name: al2
          # most likely will need a new image which has kubectl installed by default to match cluster version
          image: public.ecr.aws/amazonlinux/amazonlinux:2
          command:
            - /bin/sh
            - -c
            - |
              # we will assume kubectl is installed either via the image or script
              # and the pod will most likely require IRSA
              kubectl create namespace argocd
              kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
              ARGOCD_OPTS='--port-forward-namespace argocd'
              ARGOCD_INIT_PWD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo)
              ARGOCD_NEW_PWD=$(aws secretsmanager get-secret-value --secret-id prod/mysecret)
              argocd account update-password --current-password $ARGOCD_INIT_PWD --new-password $ARGOCD_NEW_PWD
              # Bootstrap cluster using app of apps pattern https://argo-cd.readthedocs.io/en/stable/operator-manual/cluster-bootstrapping/
              # One suite of apps could be management apps - CoreDNS, Cillium CNI, etc., another for monitoring/observability,
              # andother for applications, etc. Up to the cluster administrator(s) to decide
              argocd app create apps \
                --dest-namespace argocd \
                --dest-server https://kubernetes.default.svc \
                --repo https://github.com/argoproj/argocd-example-apps.git \
                --path apps
              argocd app sync apps \
                --async \ # async is important since it won't succeed until nodes are launched
                --retry-backoff-duration 2m  \ # add some backoff and retries to give time for nodes to launch
                --retry-limit 5 \
                --timeout 630
      restartPolicy: Never

Which service(s) is this request for? EKS

Are you currently working around this issue? A mix of hacky scripts and manual intervention - "playbooks" if you will

Additional context

stevehipwell commented 2 years ago

@bryantbiggs I think https://github.com/aws/containers-roadmap/issues/1559 would be an alternative to this, assuming the behaviours required were provided for. If I ever get some free time it's near the top of my project list, although I might create a more generic operator suitable for all kubeadm (style) clusters.

For EKS my use case is managing the default kubeadm services (CoreDNS & kube-proxy) and the CNI (aws-vpc-cni). I'd also like to see it optionally support the "required" services (aws-load-balanacer-controller, cluster-autoscaler, aws-node-termination-handler, metrics-server, etc).

bryantbiggs commented 2 years ago

@bryantbiggs I think #1559 would be an alternative to this, assuming the behaviours required were provided for. If I ever get some free time it's near the top of my project list, although I might create a more generic operator suitable for all kubeadm (style) clusters.

For EKS my use case is managing the default kubeadm services (CoreDNS & kube-proxy) and the CNI (aws-vpc-cni). I'd also like to see it optionally support the "required" services (aws-load-balanacer-controller, cluster-autoscaler, aws-node-termination-handler, metrics-server, etc).

Not quite - I am looking to avoid this statement from #1559 Initially this operator could be manually installed into a https://github.com/aws/containers-roadmap/issues/923 and used to manage add-ons.

The overall idea I am pitching here is two parts, mostly centered on the 2nd:

  1. Bare cluster free of any defaults - this is covered nicely by #923
  2. A way to programmatically bootstrap a cluster where (ideally) I can get my gitops operator installed and that operator takes over everything else from there. This is where the idea of a "user data" type flow came from - just light bootstrapping so that cluster configuration is passed off to another tool that is running within the cluster. The "user data" flow is key because I want to codify it and avoid manual intervention between provisioning the EKS resources and my gitops operator provisioning cluster resources
stevehipwell commented 2 years ago

@bryantbiggs I suspect that any solution to this is going to require compute somewhere; this issue would need something to run the bootstrap script which IMHO makes it equivalent to #1559.

I'm pretty sure you could do this already with a Lambda function, for TF a aws_lambda_invocation resource. You could then run further config once this function has completed. I don't think you'd have a dependency on #923 as your bootstrap could clean all of this up before any nodes are connected. This is something that could be added to the OSS EKS TF module; expose a variable to pass in a function to execute on create and maybe a peer module to create the lambda and execute some shell script with the correct dependencies.

For #1559, the operator could be run in Fargate (like Karpenter can be for the same reason) and would be expected to use host networking so can precede CNI configuration. Obviously with an operator it wouldn't matter if it was run on a cluster with incorrect config and attached nodes as it's running a desired state control loop with eventual consistency.

jwenz723 commented 1 year ago

Couldn't you use terraform to execute the commands which you want to occur at cluster creation time?

For example:

In this example they are creating a fargate profile to host the karpenter app.

bryantbiggs commented 1 year ago

yes, I am familiar with those since I created them 😬 - this request was more about how best to transition from infrastructure provisioning over to cluster provisioning smoothly and seamlessly. The somewhat ideal, high-level flow being:

  1. Create cluster
  2. Once the control plane is ready, a GitOps controller is installed (this was sort of the crux of the "bootstrap" idea) and configured to point at the appropriate manifest repositories
  3. The GitOps controller provisions/reconciles the resources defined at the target manifest location(s)

There are of course other, more complex scenarios such as standing up a cluster using a different CNI without chaining, managing core addons such as CoreDNS through the GitOps controller, etc.

But thats the gist of it - provision the cluster, somehow get a controller installed and pointed at manifests, said controller reconciles the intended cluster state. No hacks, no manual intervention steps, not multi-step applies required (ideally)

gmolaire commented 1 year ago

Any progress on this issue? Or everyone is using the workaround?