Investigate GitOps with Argo CD as deployment strategy

After working with Terraform as a deployment management tool for some months now, I can say that it kinda stinks. The state of the AWS/TF deployment in this repo has arrived at Durian stage: it's good, but it smells real bad. Terraform just isn't well-suited to this kind of work, and it seems to struggle to keep up. It's come to my attention that how we're doing things just doesn't resemble the best practices used by others, and we should seriously investigate alternatives.

The issue title gives away the game, but the use of GitOps is where we should be headed. This is an IaC approach, but it focuses on storing manifests describing a system to deploy in a git repo, and those manifests are the definitive source of truth for what the cluster should look like. Using a continuous deployment system like ArgoCD means that pushes to the git repo are picked up by the CD system and rolled out. Manual changes to the cluster can be configured to get rolled back so that the cluster never diverges from ground truth.

In this way, we configure a base cluster any way that makes sense, which should provide only the necessities, and then we use manifests and helm to configure everything else. That includes most of the core add-ons like Karpenter and EBS CSI as well as any of the other packages that we need for our individual use cases, like Jupyter, Dask, Argo workflows, Franklin, etc. We can maintain a collection of basic configurations in the repo, and copy in the parts we need. We can maintain multiple directories or multiple repos that describe the different deployments that we have running, and point ArgoCD to the correct location in the individual cluster instances. Different environments can be maintained in this same way, or we can use Kustomize to adjust base templates for different environments without worrying as much about diverging descriptions of what should ostensibly be the same deployment, with only minor differences.

What gives me pause from this approach is that we often will need supporting infrastructure that is not provisioned by Kubernetes—think RDS instances, route 53 endpoints/aliases, cognito user pools, etc. In the current state of affairs, the 1-services stage is where we provision those materials, but a solution-in-waiting is to use the AWS Controller for Kubernetes (ACK), which allows AWS resources to be specified via Kubernetes manifests using custom resources. These can then be managed in the same fashion using GitOps as all other cluster resources, making it unnecessary to have a bunch of custom Terraform code in different application stages as we are doing now.

The sad news is that ACK is still only ramping up. Each service you want to deploy is going to need a separate controller (which might also contribute to blowing up our base pod capacity on the core nodes), and not every service we want to use is going to have an ACK controller. The current list of available controllers and their release status is here. When this list meets our basic needs, this path becomes viable, and we can dramatically simplify our approach to deployment.

The desired future is this:

A simplified base cluster description Since the infrastructure demands for this are simple, we can feel free to not be shackled to Terraform; we can use eksctl or any other tool that makes cluster creation more accessible.
A running instance of ArgoCD or any other similar GitOps tool; this would likely be provisioned simply using Helm.
A git repo containing a menu of app configurations. Every app we deploy can be represented here, and that will give us a nice path for sharing/saving knowledge about how to deploy the things we need.
A repo describing each cluster deployment (possibly just a subtree in the repo above) that we can point ArgoCD to; these descriptions include ACK manifests for configuring AWS resources.

This vision strikes me as maintainable and simple. We came somewhat close to this, but our strategy is very heavily-dependent on Terraform, which has revealed itself to be less-than-ideal for managing k8s resources. Using the GitOps approach makes cluster management accessible to non-ops users, because it's just a matter of checking in YAML into a git repo. No special access is required, and it can be subject to all the same quality controls as a normal collaborative coding exercise.

For now, because of the lack of broad support by ACK of some services that are important to our use cases, we can't make this vision reality just yet. However, we can get somewhat close by simply moving away from including application/* stages in this deployment and lean on ArgoCD/GitOps to make those operations easier. We'll still need to provide a more-complicated-than-desired Terraform description that yet includes the 1-services stage, but we can lighten the load a bit, and take away from Terraform the responsibilities that it just isn't as adept at handling.

That's a lot of words, and it may not feel like an "issue", exactly, but here are the actionable parts:

[ ] Extract the important information from this issue into documentation in this repo
[ ] Add ArgoCD to the 1-services stage
[ ] Extract k8s resource descriptions from existing application stages into manifests
[ ] Move AWS resources from application stages into 1-services as conditionally-available resources
[ ] Figure out how to communicate AWS resource parameters (i.e., the information currently passed down the line using the output.tf files in stages 0-hardware and 1-services) to the k8s resources being configured in manifests
[ ] Allow ArgoCD to manage the deployment of resources

azavea / kubernetes-deployment

Investigate GitOps with Argo CD as deployment strategy #25