department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
79 stars 59 forks source link

Evaluate Lagoon/ArgoCD as a replacement for the custom EKS environment #6673

Closed cweagans closed 2 years ago

cweagans commented 2 years ago

Background

Taking on the development and maintenance of a Kubernetes application delivery platform is not a small task. There are certainly a number of open source components out there that we could glue together to do a fair amount of work, but I submit that we should not do that. There are many open-source delivery platforms out there that do 90% of what we’d eventually want and adopting one of these platforms could save us a significant amount of time and money.

The Platform CMS team has spent some time demoing Lagoon and we would like to investigate adopting Lagoon instead of continuing with the Argo-based EKS platform that is currently in progress. Michael Schmid, CTO of Amazee.io (the company behind Lagoon), has been extremely helpful in getting our team up to speed on Lagoon and helping us understand the platform capabilities and has offered to do a followup call with us for a Q&A with the wider team as well.

There are some foundational issues with the current EKS + Argo setup that have been large impediments to our team. I’m certain that the devops team can resolve these items given time to do so, but these are good examples of things that we’ll need to spend time resolving as our Kubernetes platform gets more complicated. That is, these issues will get larger and more frequent as more and more moving parts are added to the cluster. So far, these are the issues that we’ve run into:

  1. The Argo UI is completely unusable. This was discussed in a Slack thread - the core of the issue is that the UI is constantly refreshing due to some resources going out of sync. There are some other issues related to not using HTTPS for the UI as well.
  2. Our team is completely unable to directly access the Kubernetes API. This is because Pomerium caused some pretty significant issues (Github rate limiting + Eric mentioned some OOM-type issues). Pomerium is now disabled, which leaves our team completely unable to directly access the Kubernetes API and is now disabled. This is important because without this access, we cannot debug any kind of deployment failure, application errors, etc. We have limited access via the Kubernetes dashboard, but that’s definitely a suboptimal solution. This solution also doesn’t work for the ops team currently, so we don’t have a lot of confidence that it will continue to work for our needs.
  3. Generally speaking, we don’t have a lot in the way of deployment patterns/standards/etc. The result of this is that nearly the entire Kubernetes ecosystem of tools is available, but we have to find the right ones, assemble them into something workable, and then use it. Re-inventing this particular wheel can be fun, but it’s also somewhat dangerous in that every team on this cluster will have their own way of doing things (including us). A good example of this is horizontal pod autoscaling: if we want to dynamically scale up the application containers, we have to manually define when and how that happens, determine where to get the metrics that drive that process, figure out how to get those metrics programmatically (which could be somewhat time consuming), and then wire them up to the horizontal pod autoscaler -- and that’s if that component is already installed and running on the cluster!

To be clear: I have full confidence that the devops team can resolve these issues and other similar issues that will come up down the road. It’s certainly not a question of ability. I’m questioning whether or not we should build a Kubernetes platform, rather than whether or not we’re able to do so. Having worked at a number of organizations that have gone through a Kubernetes migration (including one whose sole offering was a Kubernetes-based managed hosting platform for web applications), my standpoint is that we’re going to spend a ton of effort just getting to feature parity with Lagoon. If I had to make a wild guess, we will spend on the order of a year getting to the kind of tightly integrated experience that Lagoon offers (with a 2-3 person team working on it full time, which I don’t believe is representative of reality at the VA currently -- the devops team has other responsibilities). In the same way that we wouldn’t build our own CMS (and would instead use a more-or-less “canned” solution like Drupal), we shouldn’t spend the time building our own bespoke application delivery platform -- at least, not without evaluating the already-working solutions.

It’s worth mentioning that Lagoon is the hosting platform that serves the entire Australian government (https://www.govcms.gov.au/). The platform has a demonstrated track record of resiliency under very high load. You can read more about how Lagoon has helped GovCMS stay online during heavy COVID-related traffic here.

The Platform CMS team put together a quick wishlist of features that we’d like to see from an application delivery platform. This list is not comprehensive, but does represent a fair amount of the functionality that we would need in order to remain productive. Details of each item can be found in the collapsed section underneath the table, including any platform-specific notes.

Tool comparison

  Argo Lagoon

Code pipeline (Github → running environment) Automatically deploy from source repository | ⚠️ | ✅ Don’t require any particular tagging/release process (should be team-defined) | ❌ | ✅ Require minimal Kubernetes knowledge to use regularly | ❌ | ✅ Manual deployment triggering | ❌ | ✅ Blue/green deployment | 🚧 | ⚠️ Continuous deployment | 🚧 | 🚧 Zero-downtime deployment | ❌ | 🚧 Deployment status feedback in Github | 🚧 | ⚠️ Workflow features (actions that can be taken on a running environment) Pre/post deploy scripts | 🚧 | ✅ Failed deployment handling | 🚧 | 🚧 Sync database/files/etc between environments | ❌ | ✅ Scheduled/manual task execution through a UI for a given environment (cron, cache clear, etc) | ❌ | ✅ Scheduled/manual task execution through a CLI for a given environment (replaces any current need for SSH + Drush) | ❌ | ✅ Application reliability and security Deployment autoscaling | 🚧 | ✅ Environment idling | ❌ | ✅ Built-in log aggregation (including relaying to external destinations if necessary) | 🚧 | ✅ Authenticate against Github (both CLI and UI) | ⚠️ | ✅ Container security scanning | ❌ | ✅ Developer quality-of-life features (features that don’t directly impact availability or deployments, but can help developers move faster) Blackfire support | ❌ | 🚧 Xdebug support | ❌ | ✅ Lando integration (pull database/files from target environment; re-use production configuration to build local environment) | ❌ | ✅ Fast syncing between environments | ❌ | 🚧 Real-time log streaming during deployments | ❌ | 🚧

Feature details - Automatically deploy from source repository: - Argo: in order to do this, we have to clone the devops repo, update the manifest, push a commit, and wait for Argo to complete. This effectively isolates the deployment process from the application and is a fairly clunky and error-prone experience. - Blue/green deployment - Argo: present, but assembly required - Lagoon: mostly exists, but the exact logic for determining when a deployment is ready must be defined by the application developer. - Continuous deployment: - This requires work on the CMS side of things to be feasible. - This is complicated by Argo somewhat in that real-time feedback for deployments is really necessary to do continuous deployment safely. - Zero-downtime deployment: - Argo does not appear to support this at all. - Lagoon does support this through the environment promotion functionality but we have to wire up some of the specifics for when/how traffic is switched without interruption. - Deployment status in Github - Argo: We may be able to wire this up somehow, but it will take a significant amount of investigation - Lagoon: Indirectly supported through pre/post deploy hooks. I have an existing script that can push deployments + PR checks via the github API that we can re-use. - Pre/post-deploy hooks - Argo: Not directly supported in a way that’s useful to the CMS team. If we want it to work, we’ll have to research it, wire it up, and build it. - Failed deployment handling: - Argo: There are some metrics-driven things that Argo can do, but it’s somewhat limited. - Lagoon: We can abort a deployment through a pre-deploy hook if we detect something is wrong. - Sync db/files/etc between environments - Argo: This is not something that Argo handles. - Lagoon: the machinery is already there for Drupal sites. We just have to define how/when we want it to work. - Scheduled/manual task exec through a UI: - Argo: Nope. It’s git-driven. - Lagoon: There’s a dedicated place in the UI to put these items. They’re defined in code in the same place that pre/post deploy hooks are defined. - Scheduled/manual task exec through a CLI: - Argo: Nope. It’s git-driven. - Lagoon: There’s a CLI that re-uses the same config that exposes the UI tasks. They’re defined in code in the same place that pre/post deploy hooks are defined. - Deployment autoscaling: - Argo: If we want it, we have to own it. It’s not built in and we have no point of reference for how to do it “correctly” in the context of the VA. - Environment idling: - This is a cost saving measure, but it’s very handy to free up capacity during traffic surges. - Argo: If we want it, we have to own it. It’s not built in and we have no point of reference for how to do it “correctly” in the context of the VA. - Built-in log aggregation/forwarding - Argo: homegrown. Requires development work to get it to cooperate sometimes. - Auth against Github (both UI and CLI) - Argo: mostly works for the UI, but CLI doesn’t work at all due to Pomerium being disabled. - Lagoon: works via the built-in Keycloak instance. - Container security scanning - Argo: not part of Argo at all. - Blackfire support: - Argo: Argo doesn’t handle this at all (out of scope) - Lagoon: can be built in the same way that xdebug is supported. Very easy PR to add this. - Xdebug support: - Argo: Argo doesn’t handle this at all (out of scope) - Lando integration: - Argo: out of scope - Lagoon: Built in and shares configuration between local and production environments for seamless local to live deployments - Fast syncing between environments - Argo: environment sync isn’t handled - Lagoon: there are pre/post deploy hooks. To make this fast, we’d need to do a little work around e.g. cloning an RDS database using the AWS API, rather than doing a mysql dump and restore - Real-time log streaming during deployments - Argo: Not supported due to no Kube API access - Lagoon: logs are reported to the UI, but not in real time. Real-time logs are available through the Lagoon CLI + the Kube API.

Because there is such a wide gap between where we’re at with the Argo-based platform and what Lagoon offers, we’d like to propose doing a timeboxed POC of some very basic functionality of Lagoon. If this is successful, we can extend that POC to include wiring up some of the Lagoon workflow features in a way that the CMS team would use on a day-to-day basis. After this, we can do a shared demo/Q&A session with the devops team and the Amazee.io team to answer any lingering questions and if everyone is happy with the outcome, we can move toward production use.

tl;dr: I believe we can save 12-36 person-months of work by abandoning our current EKS/Argo environment, adopting Lagoon, and getting a support contract with Amazee.

User Story or Problem Statement

As an engineer, I would like to re-use open source application delivery platforms in lieu of rolling our own so that we can redirect that time and energy toward something else.

Acceptance Criteria

Possible tickets to create for this epic

CMS Team

Please leave only the team that will do this work selected. If you're not sure, it's fine to leave both selected.

ElijahLynn commented 2 years ago

Very nice writeup of this challenge @cweagans!

mchelen-gov commented 2 years ago

Risks to evaluate:

In addition to POC, I recommend reaching out to Ops to check what tools or approaches exist (if any) in Platform for each of these features.