department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
99 stars 69 forks source link

Create Lagoon Architecture Documentation draft #13170

Closed dawnpruitt closed 1 year ago

dawnpruitt commented 1 year ago

Description

Ticket #13111 is the architecture diagram. The architecture documentation will be informed by the architecture diagram and by previous conversations we've had with Amazee.io about how we currently architect our projects around the existing network architecture. Additionally, this will be informed by Lagoon's limitations and verbal agreements to extend Lagoon beyond those limitations.

# ACs
- [x] Lagoon Production Documentation draft describing architecture is created
- [x] Draft provided to PO for review

Team

Please check the team(s) that will do this work.

olivereri commented 1 year ago

Lagoon Documentation

Requirements

Architecture and Design

Lagoon is a container-based application management platform built around a microservices architecture. The microservices approach involves breaking the platform down into individual containers or groups of containers, each responsible for specific tasks or functions. This approach enables each service to be tested, updated, scaled, inspected et cetera independently, which improves scalability, flexibility, and maintainability.

The Lagoon platform is divided into two major components: Core and Remote.

Kubernetes is provided to the CMS team and administered by VFS-Platform through AWS Elastic Kubernetes Service (EKS). The AWS network that EKS uses is segmented into 4 parts: Utility, Dev, Staging, and Prod. Utility can communicate directly with the others, but the Dev, Staging, and Prod segments are isolated from one other for security reasons.

Because of the above security consideration and best practice recommendation from Lagoon our implementation will consist of:

Architecture Diagram

Each instance of the CMS hosted on a Remote will connect to AWS services outside the EKS cluster:

This leverages AWS-provided management and redundancy rather than attempting to replicate that same functionality within Lagoon.

Each EKS Cluster is connected to internet through VA's Trusted Internet Connection (TIC) which is a set of network security and boundary devices that protects the VA internal network. This is of important note because of heavy restrictions on the ports and protocols applied to bidirectional communication with external dependencies. Additionally, there is significant added latency and reduced bandwidth due to network inspection across the Open Systems Interconnection (OSI) layers.

The end result is that application builds can take an unacceptable amount of time, especially considering that most of the traffic transiting the TIC will not differ between individual builds. We expect to solve this problem using Nexus Repository Manager. Nexus caches packages and other software dependencies between builds, minimizing the performance impact of the TIC.

Assumptions

Requires Amazee.io Support and/or Software Development

Requires DSVA Platform Operations Support:

Local Development

DevOps Engineer

Pygmy is a Docker-based Drupal Development environment that simplifies local development environments for web applications. Pygmy can be used in conjunction with Lagoon to provide a local development environment that closely matches the production environment in Lagoon.

Pygmy depends on the existence of the below files. Of important note these files with the addition of .lagoon.yml is exactly what a Lagoon Remote will deploy to run an application.

docker-compose.yml .lagoon/cli.dockerfile .lagoon/nginx.dockerfile .lagoon/php.dockerfile

Pygmy and Lagoon both use Docker Compose, a multi-container definition of an application, to deploy and run an application. This makes Pygmy a powerful tool for verifying an application will run as expected in production; local container images, package versions, services, and configuration exactly match those deployed to production. If it runs locally, it will run on production.

Drupal Engineer

Lagoon will have no impact for CMS Engineers on WIndows, Linux, or Mac (whether Intel or ARM), unless we want it to. The usage of Lagoon does not obviate the use of existing local development tools, i.e. DDEV.

It is actually preferable for most engineers to continue using DDEV for local development rather than Pygmy; outside of Pygmy's effort to enforce parity between local development and the production environment, there is no meaningful benefit to using Pygmy, and it would require a substantial migration process and retraining the team.

Deployment Process

The deploy process will be very much like the current Build Release and Deploy pipeline. There will be major differences in that Jenkins no longer executes tasks and those tasks are no longer defined in Ansible.

Instead Lagoon Core will receive Webhook requests, then build and deployment tasks will be carried out by Lagoon Remotes.

The overall deployment process is broken into two parts:

  1. Commits (chiefly, merged pull requests) to the repository's main branch are deployed and tested on the Staging Lagoon Remote.
  2. Production Releases and Deployments are scheduled and triggered by Github Actions. GHA runs a workflow that discovers the last successfully deployed commit to Staging. Then it promotes it to Production by using all built Staging images and running post-rollout tasks.

PR Merge to Deployment

Deployment on Production

Disclaimer: There currently is no Lagoon functionality that mimics or replicates our current Production Deployment system.

Requires Amazee.io Support and/or Software Development:

Todo

Generally all of the Jobs or Tasks handled by Jenkins will need to be replicated in Lagoon. Many of the tasks may be covered by what Lagoon does by default e.g. Build and Deploy. Of particular note will be the runtime tasks listed in the issue linked below.

Each task category will need to be reviewed and a determination made on whether:

All Jenkins jobs that support Build, Deploy, and Runtime CMS tasks.

Access

This section will describe how we access various components of the new architecture.

Lagoon Dashboard

The Lagoon Dashboard is the web-based interface for managing projects, environments, and deployments on the Lagoon platform. It provides an overview of the current state of our applications, as well as access to logs, metrics, and other essential information. This will likely be used mostly by DevOps engineers, but should be accessible by every member of the team.

Lagoon CLI

The Lagoon CLI allows engineers to access Lagoon environments through a command-line or "shell" interface and run tasks, introspect the running environments, etc. We anticipate that this will use the SOCKS proxy that engineers are trained on and configure as part of their onboarding.

Drupal

SOCKS

As with the current architecture, a SOCKS proxy can be used as a secure connection to Drupal environments running within Lagoon. We do not anticipate any changes to configuration or documentation.

GFE

Accessing Drupal within Lagoon environments from within the VA network, e.g. GFE with a VPN, should similarly remain unchanged.

Service Containers

Other services, e.g. Keycloak and Harbor, are primarily only of interest to DevOps engineers. To the extent that they require any sort of direct human interaction, this will likely be accomplished via Kubernetes administration tools like Lens, EKS management, or other official tools. We do not anticipate a need for training, documentation, or discovery on these aspects of access.

Requires Amazee.io Support and/or Software Development:

Requires DSVA Platform Operations Support:

Roll Out Plan

Below describes a rough roll-out plan for Lagoon. Ideally this should start with migrating CMS-Test infrastructure to Lagoon from BRD. This will serve as an excellent testbed for the eventual CMS Prod infrastructure to Lagoon. Lesson learned during this functional exercise will contribute profoundly. Lastly, many of these tasks are captured in the overall Lagoon implementation Github Project and listed below.

Related Issues:

ndouglas commented 1 year ago

I had this for the deployment flow:

PR Merge to Deployment on Staging

Deployment on Production

Does that match reality/our understanding more-or-less? I didn't want to change what you had, because I'm not completely confident of my understanding.

ndouglas commented 1 year ago

I fluffed up some sections that seemed a little terse to me, fixed a couple typos, and fleshed out the sections on Access -- I think I know what you meant here, so I went with it because it seemed like a boring section to write 🙂 If I was completely off the mark then feel free to delet this.

olivereri commented 1 year ago

Deployment on Production

Disclaimer: There currently is no Lagoon functionality that mimics or replicates or current Production Deployment system.

Note: For this deployment pipeline design to work as intended Lagoon must be extended. The below listed issues must be resolved before this is possible: