department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
97 stars 69 forks source link

[SPIKE] K8s & Argo CD learning #6355

Closed acrollet closed 3 years ago

acrollet commented 3 years ago

Description

As a CMS devops engineer, I wish to gain familiarity with Argo CD, so that I can deploy the CMS on EKS.

Acceptance Criteria

Notes

This is a 1-2 week learning hackathon to get ready for the migration so that we get some base learning and don't have to learn everything by doing, and can start doing with some base foundations in place.

olivereri commented 3 years ago

@acrollet This shouldn't be in-progress right? Just refined?

acrollet commented 3 years ago

Re-estimated 9/13

olivereri commented 3 years ago

K8s and Argo Slack learning thread with OPs: https://dsva.slack.com/archives/CJYRZK2HH/p1631294685349500

ElijahLynn commented 3 years ago

K8s and Argo Slack learning thread with OPs: https://dsva.slack.com/archives/CJYRZK2HH/p1631294685349500

From the above Slack thread, here are some of the suggestions:

olivereri commented 3 years ago

Gotta keep an eye on Argo Rollouts that gives the ability to do BlueGreen and Canary deployments with ArgoCD. Saw a mention in the Ops Standup notes for September 9th. It's been deployed to Ops' utility apps.

ndouglas commented 3 years ago

I think I'm gonna:

My dotfiles contain/double as my homelab IaC and Ansible's self-documenting enough to serve as an excellent reference for me, so this should be really helpful in the long run.

I made this (very loose) plan last Wednesday afternoon, and Thursday I set about actually creating a cluster locally.

olivereri commented 3 years ago

I'm starting to think that ArgoCD is a smaller piece than we think to get CMS on K8s. Seems like a lot more work is going to be spent crafting pipeline jobs in GitHub actions. ArgoCD looks pretty set it and forget it. We'll define a manifest file that references an ECR hosted container image, update that reference programmatically (GHA), and ArgoCD's Git polling with catch the change and deploy the new image.

This looks like a really simple version what we may end up doing:

https://github.com/department-of-veterans-affairs/vsp-infra-grafana https://github.com/department-of-veterans-affairs/vsp-infra-application-manifests/tree/main/apps/vsp-operations/grafana

vsp-infra-grafana repo holds the Grafana configuration in a Dockerfile. When I make a configuration change to it, commit then push, GitHub Actions: 1 . Builds and pushes the container to ECR

  1. Checks out the vsp-infra-application-manifests repo
  2. Edits the Grafana manifest file with the new docker image tag.
  3. Commits and pushes the new manifest file.
ndouglas commented 3 years ago

This is kind of a journal entry about what I've been up to.

Thursday

I went through Kubernetes the Hard Way on a couple of VMs, mostly just to reacquaint myself with some of the basic structure, then wiped 'em out.

My goal was to get a cluster working cleanly in LXC containers. LXC has a lot of appeal to me because of its bovine nature. I've created hundreds of VMs over the years for various reasons and they always seemed like pets. Even when I got into Vagrant. Part of that might be because of the stateful, sprawling nature of Drupal; the setup process was long enough that I never wanted to kill the VMs.

Launching an LXC container's more akin to launching a Docker container, volume sharing is done similarly, scripting actions within the container, etc etc etc.

Unfortunately, the last time I considered this, there was aggravation: a kernel issue prevented working pod networking, and /dev/kmsg, which is required by K8s, wasn't created inside the container. /dev/kmsg provides access to the kernel's printk buffer, which I think is the source of dmesg et al. I'm not particularly interested in maintaining/applying patches and manually compiling Linux -- I did that plenty when I ran Gentoo fifteen years ago and spoiler alert it did nothing for my social life. And my attempt at bind-mounting /dev/kmsg failed, and symlinking it worked occasionally -- until it didn't and the container would peg the CPU at launch until reboot.

Anyway, turns out that a patch for the former made it into Linux at some point, and I figured out the syntax for bind-mounting /dev/kmsg -- apparently I'd messed that up. There were a couple of other annoyances along the way, but ultimately I was able to use geerlingguy.kubernetes with only a few minor modifications to the LXC config:

        lxc.apparmor.profile: unconfined
        lxc.apparmor.raw: mount,
        lxc.cap.drop:
        lxc.cgroup.devices.allow: a
        lxc.mount.auto: proc:rw sys:rw
        lxc.mount.entry: /dev/kmsg dev/kmsg none defaults,bind,create=file

And I was up and running! 🎉 Creating a cluster takes only a couple of minutes, and they shutdown/relaunch essentially immediately. I could tighten up the AppArmor, but I'm in rapid-prototyping mode right now.

Friday

So I deployed Traefik via Helm and curled it and... connection refused. wat.jpg

I cycled through a few things. I hadn't used Helm before (it was very new when I last had an actual cluster in operation, and I wrote all of my YAML by hand in the snow 30 miles uphill both ways). For whatever reason, probably just sheer stupidity on my part, it always takes me a couple tries to get Traefik working like it's supposed to. In fact, I'm convinced Traefik hates me, and only grudgingly will settle down eventually and do as told. But I like Traefik and have never had an issue with it once I get it working, so I keep using it.

And the aggravating thing was that it worked intermittently, then I'd try it on another cluster (I have four) and it wouldn't work. Not gonna lie, I felt some despair. NodePort's a pretty simple concept but I just. couldn't. get. it. to. work.

Eventually I realized that the port was working on one of the nodes. Just not the master node, which I'd been testing most of the time. Of course, it was working on the node where the pod was deployed. Which meant something was broken with kube-proxy. I execed into a pod, tried to ping the clusterIP of another pod in the same namespace, and... nothing. I tracerouted and found that the route actually went through my home router. So there's some kind of CNI issue. I thought there might be an issue with iptables or the kernel networking (see above issue with the br_netfilter module) or just something dumb I was doing wrong (I hadn't bothered to mount /lib/modules into the container, so I thought there might be a basic networking incompatibility or something).

After troubleshooting for a while, I realized that there was some incompatibility between flannel, or how it was configured, and my infrastructure. I recreated the cluster with calico, but that encountered a distinct but different problem, and Geerling's role has an issue with the networking that would make it a little more aggravating to troubleshoot the network.

However, weave worked out of the box! So after quite some time, troubleshooting, and aggravation, I was able to get kube-proxy functioning as intended, and the cluster actually clusters and behaves as it should. I'm not sure where the issues with flannel and calico are, and I'm not terribly inclined to troubleshoot further on the clock; ultimately, this is work, and we're going to be using EKS, and presumably the EKS VPC CNI will work fairly well and these other CNIs will be largely irrelevant and I don't need to know their inner workings. That said, it was an educational troubleshooting experience.

So today (Monday) I'm going to continue on this path, trying to remember what I've forgotten and learn things I've never learned. My troubleshooting revealed some serious gaps in my ability to diagnose and debug K8s stuff, so I'd like to dive deeper into how pod networking in general works and how I can git gud at diagnosing issues there. TL;DR: I need to actually work with Kubernetes, now that I can reliably create a working cluster.

I'm a little behind schedule in terms of where I hoped to be at this point -- I didn't expect to blow a work day and a substantial portion of a weekend debugging CNI -- but it's all good in the chüd if I end up being a little more competent to deal with the inevitable issues when they arise.

ndouglas commented 3 years ago

Monday

I honestly don't remember setting up an ingress controller for my k8s cluster in 2016 or whenever that was. I know I was able to access the services, but I don't remember anything about the implementation, or even what flavor I used (probably Nginx or HAProxy -- I don't think I'd met Traefik yet, but I knew Nginx and used HAProxy on my firewall).

Ingress is where I find Kubernetes gets complicated. Again, this presumably isn't a problem on EKS, because as I understand it ingresses will just spin up a ALB and things Just Workâ„¢. (Disclaimer: I'm gonna do the EKS Workshop soon, I have a tab open for it, but I've never read any of the docs). But on a bare metal installation, you have to bring your own ingress controller, LoadBalancer-type services do diddly-squat by themselves, and so there're more nightmares lurking around the corner.

I wanted my bare metal install to match production as much as possible, though, so rather than going straight to Traefik I thought it might be cool to deploy MetalLB, a bare metal load-balancer that, theoretically, provides the functionality that comes stock on EKS, GKE, and other cloud providers. I decided I'd allocate a couple hours to try to get it working.

MetalLB offers two modes of operation:

I attempted both, but no dice. The issue might've been a result of the structure of my network -- I have three Proxmox servers, each of which manages a distinct subnet, like 10.1.0.0/24, 10.2.0.0/24, etc. The k8s clusters span the subnets to mimic availability zones, so the cluster karhold is composed of nodes karstark, kellington, and kettleblack at 10.1.0.110, 10.2.0.110, and 10.3.0.110. (Don't worry, these are cattle, not pets -- the hostnames are assigned automatically and DHCP reserved based on generated MAC addresses.)

So generating a contiguous range of IP addresses via layer2 was probably problematic from the start because there was no guarantee that the node hosting the pod actually existed within the specified subnet, and then routing was bananas. EDIT: Confirmed.

I'm not that deep into networking and BGP was fairly new to me, but hey, good opportunity to learn. So I set up BGP via FRR on my firewall/router, allocated AS IDs for each subnet, neighbored everyone to everyone else, recreated everything in the MetalLB namespace, and... still didn't work. EDIT: (And I think it wouldn't anyway).

Now, there are other complications here, given that the Proxmox nodes have firewalls, bridge interfaces, and so forth, and the node containers are running iptables, etc. But I feel like it still should've worked to some extent. I wasn't able to arping the MetalLB-allocated IP addresses, and arp showed the IP address but did not resolve it to a MAC address. IDK what that's about.

Anyway, at this point (an hour or two in), I reached my timebox for trying to get MetalLB working. I might return to it later, but if so, it'll be on my own time.

I soothed myself by watching a Geerling video about dealing with Drupal in Kubernetes, then set about relearning ingress.

That wasn't without its own hiccups, of course, but it's an area somewhat more within my experience -- beating my head against software configuration, rather than beating my head against routing tables. The end result was that I was able to get Traefik proxying nodes across several different namespaces pretty smoothly.

At that point, my brain was thoroughly fried, so I retired to the basement and watched Flesh+Blood (1985), which was... weird.

olivereri commented 3 years ago

Kubernetes Install Journey

Helpful Guides

These two links below offer a pretty quick and comprehensive guide to setting up Ubuntu in WSL and installing Micro Kubernetes. It takes into account configuring WSL, configuring the Ubuntu VM for use with WSL, and Kubernetes configurations. https://ubuntu.com/blog/kubernetes-on-windows-with-microk8s-and-wsl-2 https://wsl.dev/wsl2-microk8s/

This really helped going zero to K8s and ArgoCD installed in a short amount of time. The most beneficial thing is that it covers some of the setup and configuration pitfalls with K8s.

Things like:

  1. ClusterIP and External IP mapping
  2. Installing Metallb to automagically expose services externally.

Quality of Life changes

Windows will annoyingly make a chime or bell sound on tab completes that don't have a completion. I'm using Windows Terminal and there is an easy profile change to stop this from happening. Documented in this superuser thread: https://superuser.com/questions/1108120/how-to-disable-bash-on-windows-notification-sound-effect

Tips

ArgoCD admin Password Retrieval: microk8s.kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Kubernetes Dashboard Token Retrieval: microk8s.kubectl -n kube-system describe secret $token

Probelms

Victories

Kubernetes installed and accessible via localhost in browser: image.png

ArgoCD installed and accessible via localhost in browser: image.png

ndouglas commented 3 years ago

Tuesday

I got sidetracked by #6251 and a couple minor things, so this will be comparatively brief.

I started playing with ArgoCD now that I got the ingress figured out. To log in, you gotta retrieve a secret from the cluster. That was when I realized that I hadn't really bothered to think much about secrets. I saw that @olivereri had opened some issues, and sure enough, a couple of them alluded to SSM's parameter store. I've used it before on other projects, so I'm familiar with its general characteristics, but not with Kubernetes. So it looks like I need to set up a CSI driver, which I haven't done before and didn't actually start. I did spend some amount of time thinking about setting up HashiCorp Vault, but ultimately came to the conclusion that if we're using SSM directly I should mirror that. EDIT: Doesn't look like this will actually work outside of EKS. I can still use the SSM param store, but not in the same way AFAICT.

This was preëmpted by my realization that I hadn't really thought about persistent storage either, other than to say "I'll worry about that later once I can actually access my cluster." I'd used GlusterFS before, but I think we're going with EBS and EFS (or maybe just EFS) in EKS, so the natural bare metal equivalent there is Local – which IDK if I'll bother with – and NFS. I wrote some Ansible to add NFS exports for each cluster, but weirdly enough this ended up taking quite a while and included a long overdue refactoring of my inventory into group_vars and host_vars. That was off-the-clock though.

I also poked around in Argo CD some, but not much -- mostly just enough to realize that I still have some infrastructural deficits to make up for before I can get full functionality from it. I dug some into the App of Apps pattern, which appeals to my love of solving problems by adding a layer of indirection.

So... hoping I can resolve all of this today 😃

olivereri commented 3 years ago

Found something interesting called Resource Hooks in ArgoCD https://argoproj.github.io/argo-cd/user-guide/resource_hooks/

Synchronization can be configured using resource hooks. Hooks are ways to run scripts before, during, and after a Sync operation. 
Hooks can also be run if a Sync operation fails at any point. Some use cases for hooks are:

- Using a PreSync hook to perform a database schema migration before deploying a new version of the app.
- Using a Sync hook to orchestrate a complex deployment requiring more sophistication than the Kubernetes rolling update strategy.
- Using a PostSync hook to run integration and health checks after a deployment.
- Using a SyncFail hook to run clean-up or finalizer logic if a Sync operation fails. SyncFail hooks are only available starting in v1.2

Hooks can be any type of Kubernetes resource kind, but tend to be Pod, Job or Argo Workflows. Multiple hooks can be specified as a comma separated list.

ndouglas commented 3 years ago

Wednesday

So... hoping I can resolve all of this today 😃

Requiem_for_a_dream

Same energy.

I'd started two things on Tuesday that I wanted to resolve Wednesday: secrets with a lifetime longer than the cluster lifetime and persistent volume claims.

I used external-secrets to handle retrieving secret values from the AWS SSM parameter store. This Just Workedâ„¢. It's not a direct analogue for the way things work in EKS, but it's sufficiently close. It's read-only AFAICT, which makes the policy easy, only 5-6 discrete actions or so. Not expecting any surprises in the Ops onboarding there.

I used the nfs-subdir-external-provisioner, which was known as the NFS client provisioner back in the day, to set up the NFS shares as functional equivalents of the EFS shares I think we're gonna be using. This also Just Workedâ„¢. I was a bit concerned that I'd need to do some dark magic or broaden the range of IP addresses in the /etc/exports for each share, but the provisioner's mount requests are seen as coming from the k8s node and not the pod, so this was clean and worked as expected. A test pod successfully made a claim against the NFS server, a volume was allocated, and a file was stored to it. All peachy.

So, with secrets and persistent volumes in place, I decided to take a detour and set up Drupal, just to see what would work and what wouldn't.

if you are reading this, we need developers

So that was all fine, and things seemed to work as expected.

Well, this part isn't ideal:

Screen Shot 2021-09-23 at 7 39 31 AM

First, it's not TLS, second, the port isn't what I'd like it to be. So I set about trying to do a few things:

  1. Get Traefik to expose itself on a fixed port, ideally 80/443, on each node.
  2. Get Traefik to request certs from ACME/LetsEncrypt and use them to secure connections for bare domains at least.
  3. Get Traefik to use cert-manager to manage and persist certificates and share them between nodes and use those to secure connections.
  4. Get Traefik to proxy gRPC connections intelligently to ArgoCD so that ArgoCD's CLI will function as intended.

As you can see, all of these things have to do with Traefik, and they collectively deducted about 6-7 months from my expected lifespan over the past 24 hours.

I think the problem came down to the Helm chart. I like Helm, but I think now that it should only be used after you install and configure and test the application without Helm. And Traefik has a verbose chart that passes through a ton of configuration, just as ArgoCD does.

But $INCLUSIVE_PRONOUN, it's a bear to troubleshoot these things when there's a problem.

I haven't figured out the root cause of my issues, so we'll have to wait for a future comment for a post-mortem of my failure to debug this problem.

TL;DR:

So that's a drag.

Today I'm going to go for a walk, eat breakfast, and then get back into the 💩 and see what issues I can resolve. Traefik's kind of important, being the ingress controller and all. I also want to do the EKS workshop and actually build some Argo CD applications. The former I'd prefer to leave until I have a fully functional local cluster, and the latter I can't really do locally until I get the gRPC proxying working.

ndouglas commented 3 years ago

Bonus meme from yesterday in response to something Elijah said: c2f

olivereri commented 3 years ago

From the prophetic words of @indytechcook

Just Google it, I'm sure someone has tried to solve this

A quick google search turned up a bounty of wonderful information. One from our friend Jeff Geerling: https://www.jeffgeerling.com/blog/2019/running-drupal-kubernetes-docker-production

Also, a concrete example of creating K8s Deployments and Services to launch Drupal: https://github.com/IBM/drupal-on-kubernetes-sample

ndouglas commented 3 years ago

Thursday

I'd set up NFS PVs on Wednesday, but nobody wants to see Drupal with MariaDB running off NFS. So I set up local-static-provisioner to allow provisioning local volumes and altered my Ansible to attach 32GB (thinly-provisioned) of SSD to each LXC container at creation. dding /dev/urandom to a dummy file gave 177MB/sec, which isn't bonkers but between thin provisioning and presumably suboptimal options for dd and so forth I'm pretty happy.

Predictably, it wasn't quite so easy -- MariaDB went into CrashLoopBackOff. When I investigated the logs, I saw an error: path is not a shared or slave mount. This triggered investigation:

Red Hat:

A shared mount allows the creation of an exact replica of a given mount point. When a mount point is marked as a shared mount, any mount within the original mount point is reflected in it, and vice versa.

This is apparently enabled by default in most major Linuxes/Linuxen. With one exception: Ubuntu, which I use as the template for my LXC containers 😞

This is easy to fix: just run mount --make-rshared /. As soon as I did this, the PVCs were provisioned, MariaDB exited CrashLoopBackOff, and I could install Drupal again. After some going back and forth between whether to do this in rc.local or systemd or an LXC container hook, I ended up going with systemd just because Ansible handles it very cleanly.

cert-manager needed some tweaking -- after the thing with Docker, I was being careful not to blow through ACME's API limits so I used the staging server, but once that was working cleanly with Traefik, Traefik wouldn't stop handing me the staging certs. I'm not sure at what level the caching was happening, but that was an annoyance.

Traefik, tho... oh, Traefik. Traefik, Traefik, Traefik. After some tweaking, I got Traefik to function at ports 80 and 443. But after a while, I realized that subsequent redeploys would break it -- I'm not sure if ports weren't being unbound correctly, or what. But then, I rebuilt all four clusters from scratch, and... three clusters bound to the ports correctly, but the fourth did not. After deleting the namespace and redeploying Traefik, only two of them worked.

So I went back to exposing Traefik via NodePort, which at least has worked consistently. The problem then was that Traefik isn't acknowledging some ingresses -- it's being rather vague as to why, but it seems that it's not matching some ingresses on the TLS port and is matching them on the non-TLS port. This seems contrary to everything I've read in the documentation. I've considered living in a cave and eating rodent droppings for nourishment as a viable alternative to further work with Traefik.

There's quite a bit left to do and learn, and I guess the pure exploration is going to continue off-the-clock for the foreseeable future while I focus on the day-to-day demands on the job. There's some stuff I'd like to finish up today, but given that it involves ingress I don't know how smoothly it'll go 😬

ndouglas commented 3 years ago

https://user-images.githubusercontent.com/1318579/134729305-257c40c5-f8dc-487c-bf11-d754076fac23.mov

@olivereri

ndouglas commented 3 years ago

I'm going to formally just accept that Traefik is going to handle HTTP and HTTPS on a single port. It's really thrown me for a loop, but okay.

I ditched the IngressRoute (Traefik CRD) approach to also handle gRPC on the same port and just used an annotation to choose h2c protocol, and it worked on the first try 😕

Screen Shot 2021-09-24 at 6 02 38 PM

I don't really drink, but if I did, this would be why.