dawnpruitt commented 1 year ago

Description

Ticket #13111 is the architecture diagram. The architecture documentation will be informed by the architecture diagram and by previous conversations we've had with Amazee.io about how we currently architect our projects around the existing network architecture. Additionally, this will be informed by Lagoon's limitations and verbal agreements to extend Lagoon beyond those limitations.

# ACs
- [x] Lagoon Production Documentation draft describing architecture is created
- [x] Draft provided to PO for review

Team

Please check the team(s) that will do this work.

[x] CMS Team
[ ] Public Websites
[ ] Facilities
[ ] User support

olivereri commented 1 year ago

Lagoon Documentation

Requirements

Kubernetes v1.21 (Platform managed EKS clusters are v1.23)
Familiarity with Helm, Helm Charts, and kubectl
Ingress-nginx as recommended Ingress Controller.
Cert Manager
StorageClasses (RWO as default, RWM for persistent types)

Architecture and Design

Lagoon is a container-based application management platform built around a microservices architecture. The microservices approach involves breaking the platform down into individual containers or groups of containers, each responsible for specific tasks or functions. This approach enables each service to be tested, updated, scaled, inspected et cetera independently, which improves scalability, flexibility, and maintainability.

The Lagoon platform is divided into two major components: Core and Remote.

The Core component is responsible for managing critical services such as the API, authentication, and external communication. The Core component can serve multiple Remote instances (and will, in our case). It is recommended to be installed on a separate Kubernetes cluster from the Remote instances. This separation ensures better isolation and stability of the core functionality.
The Remote component focuses on services related to provisioning, deployment, and hosting applications (e.g. Drupal). It manages the underlying infrastructure and resources required to run applications in various environments, such as development, staging, and production. By maintaining a clear distinction between Core and Remote components, Lagoon can provide a streamlined and efficient platform for managing containerized applications, ensuring that each component can evolve and scale independently.

Kubernetes is provided to the CMS team and administered by VFS-Platform through AWS Elastic Kubernetes Service (EKS). The AWS network that EKS uses is segmented into 4 parts: Utility, Dev, Staging, and Prod. Utility can communicate directly with the others, but the Dev, Staging, and Prod segments are isolated from one other for security reasons.

Because of the above security consideration and best practice recommendation from Lagoon our implementation will consist of:

One Lagoon Core (including some additional third-party open source services, like Harbor, K8up, and Keycloak) installed on the Utility EKS Cluster.
Three Lagoon Remotes for application deployment and hosting will be installed on:
- Dev EKS Cluster
- Staging EKS Cluster
- Prod EKS Cluster

Architecture Diagram

Each instance of the CMS hosted on a Remote will connect to AWS services outside the EKS cluster:

Relational Database Service (RDS), running MariaDB for general database needs
Elasticache, running Memcached for in-memory caching
Simple Storage Service (S3), for database and uploaded file backups

This leverages AWS-provided management and redundancy rather than attempting to replicate that same functionality within Lagoon.

Each EKS Cluster is connected to internet through VA's Trusted Internet Connection (TIC) which is a set of network security and boundary devices that protects the VA internal network. This is of important note because of heavy restrictions on the ports and protocols applied to bidirectional communication with external dependencies. Additionally, there is significant added latency and reduced bandwidth due to network inspection across the Open Systems Interconnection (OSI) layers.

The end result is that application builds can take an unacceptable amount of time, especially considering that most of the traffic transiting the TIC will not differ between individual builds. We expect to solve this problem using Nexus Repository Manager. Nexus caches packages and other software dependencies between builds, minimizing the performance impact of the TIC.

Assumptions

Nexus will be available for use with Lagoon Builds.
Current Deployment method/model on BRD can be replicated on Lagoon.

Requires Amazee.io Support and/or Software Development

Requires DSVA Platform Operations Support:

Local Development

DevOps Engineer

Pygmy is a Docker-based Drupal Development environment that simplifies local development environments for web applications. Pygmy can be used in conjunction with Lagoon to provide a local development environment that closely matches the production environment in Lagoon.

Pygmy depends on the existence of the below files. Of important note these files with the addition of .lagoon.yml is exactly what a Lagoon Remote will deploy to run an application.

docker-compose.yml .lagoon/cli.dockerfile .lagoon/nginx.dockerfile .lagoon/php.dockerfile

Pygmy and Lagoon both use Docker Compose, a multi-container definition of an application, to deploy and run an application. This makes Pygmy a powerful tool for verifying an application will run as expected in production; local container images, package versions, services, and configuration exactly match those deployed to production. If it runs locally, it will run on production.

Drupal Engineer

Lagoon will have no impact for CMS Engineers on WIndows, Linux, or Mac (whether Intel or ARM), unless we want it to. The usage of Lagoon does not obviate the use of existing local development tools, i.e. DDEV.

It is actually preferable for most engineers to continue using DDEV for local development rather than Pygmy; outside of Pygmy's effort to enforce parity between local development and the production environment, there is no meaningful benefit to using Pygmy, and it would require a substantial migration process and retraining the team.

Deployment Process

The deploy process will be very much like the current Build Release and Deploy pipeline. There will be major differences in that Jenkins no longer executes tasks and those tasks are no longer defined in Ansible.

Instead Lagoon Core will receive Webhook requests, then build and deployment tasks will be carried out by Lagoon Remotes.

The overall deployment process is broken into two parts:

Commits (chiefly, merged pull requests) to the repository's main branch are deployed and tested on the Staging Lagoon Remote.
Production Releases and Deployments are scheduled and triggered by Github Actions. GHA runs a workflow that discovers the last successfully deployed commit to Staging. Then it promotes it to Production by using all built Staging images and running post-rollout tasks.

PR Merge to Deployment

An engineer merges an approved PR to main.
This event triggers GitHub to dispatch a webhook request to Lagoon Core, in the Utility cluster.
Lagoon Core receives the request and instructs the Lagoon Remote instance in the Staging cluster to build an image.
The image is conveyed
Pre and Post rollout scripts configure Drupal, database, and run tests

Deployment on Production

A Lagoon scheduled Custom Task triggers a promotion from the Staging environment to Production via curl.
The promotion workflow determines that last successfully deployed commit from the staging environment.
Checkout the Git branch main in order to load the .lagoon.yml and docker-compose.yml files (Lagoon still needs these in order to fully work).
Creates all Kubernetes/OpenShift objects for the defined services in docker-compose.yml , but with LAGOON_GIT_BRANCH=main as environment variable.
Copy the last successfully deployed Images from the Staging environment and uses them (instead of building Images or tagging them from upstream).
Run all post-rollout tasks like a normal deployment.

Disclaimer: There currently is no Lagoon functionality that mimics or replicates our current Production Deployment system.

Requires Amazee.io Support and/or Software Development:

Todo

Generally all of the Jobs or Tasks handled by Jenkins will need to be replicated in Lagoon. Many of the tasks may be covered by what Lagoon does by default e.g. Build and Deploy. Of particular note will be the runtime tasks listed in the issue linked below.

Each task category will need to be reviewed and a determination made on whether:

Lagoon already does it.
It's no longer needed.
Lagoon Workflows will need to be created to mimic the function.
Other supporting software (e.g Github Actions) will need to be used to mimic the function.

All Jenkins jobs that support Build, Deploy, and Runtime CMS tasks.

Access

This section will describe how we access various components of the new architecture.

Lagoon Dashboard

The Lagoon Dashboard is the web-based interface for managing projects, environments, and deployments on the Lagoon platform. It provides an overview of the current state of our applications, as well as access to logs, metrics, and other essential information. This will likely be used mostly by DevOps engineers, but should be accessible by every member of the team.

Lagoon CLI

The Lagoon CLI allows engineers to access Lagoon environments through a command-line or "shell" interface and run tasks, introspect the running environments, etc. We anticipate that this will use the SOCKS proxy that engineers are trained on and configure as part of their onboarding.

Drupal

SOCKS

As with the current architecture, a SOCKS proxy can be used as a secure connection to Drupal environments running within Lagoon. We do not anticipate any changes to configuration or documentation.

GFE

Accessing Drupal within Lagoon environments from within the VA network, e.g. GFE with a VPN, should similarly remain unchanged.

Service Containers

Other services, e.g. Keycloak and Harbor, are primarily only of interest to DevOps engineers. To the extent that they require any sort of direct human interaction, this will likely be accomplished via Kubernetes administration tools like Lens, EKS management, or other official tools. We do not anticipate a need for training, documentation, or discovery on these aspects of access.

Requires Amazee.io Support and/or Software Development:

Requires DSVA Platform Operations Support:

https://github.com/department-of-veterans-affairs/va.gov-cms/issues/13268

Roll Out Plan

Below describes a rough roll-out plan for Lagoon. Ideally this should start with migrating CMS-Test infrastructure to Lagoon from BRD. This will serve as an excellent testbed for the eventual CMS Prod infrastructure to Lagoon. Lesson learned during this functional exercise will contribute profoundly. Lastly, many of these tasks are captured in the overall Lagoon implementation Github Project and listed below.

Install Lagoon per Architecture Diagram.
- Start with CMS-Test infrastructure first to validate Database and Memcache connections.
Create Nexus repository for Drupal
Configure Webhook between Github and Lagoon for CMS-Test repo.
Configure Lagoon to Deploy CMS-Test
- Validate network connectivity and routes
- Update DNS to point lagoon.cms.va.gov, or something like it, to Lagoon hosted CMS-Test application
- Validate Database and Memcache connections
Migrate applicable Jenkins Jobs to CMS-Test Lagoon infrastructure
Create workflows for Prod release marking and promotion of Staging to Prod.
- Validate production deploy works as intended.
Document journey to transition CMS-Test to Lagoon
- Reuse documentation to create a cutover and rollback plan from CMS.

Related Issues:

ndouglas commented 1 year ago

I had this for the deployment flow:

PR Merge to Deployment on Staging

A developer merges an approved PR into main.
This triggers GitHub to dispatch a webhook request to Lagoon Core, in the Utility cluster. This event includes the identifier of the latest commit to the main branch.
Lagoon Core receives the request and instructs the Staging Lagoon Remote to build an image based on that commit.
The Staging Lagoon Remote instance:
- builds a new container image containing the version of the codebase corresponding to that commit, including packages and dependencies as cached by Nexus Repository Manager,
- passes that image to the Harbor instance running alongside the Lagoon Core deployment,
- deploys that image internally as a test environment (including pre/post-rollout tasks),
- runs the full battery of automated tests against the image, and
- informs GitHub of each test's pass/fail status.

Deployment on Production

A scheduled GitHub Actions workflow dispatches a request to Lagoon Core to perform the daily deploy.
Lagoon Core queries GitHub for the last "passed" commit on main; that is, the most recent commit that has been successfully built and has passed the full battery of automated tests.
Lagoon Core instructs the Production Lagoon Remote to build images corresponding to that commit and promote them into the actual production environment, i.e. serving all traffic to editors
Production Lagoon Remote requests images corresponding to that commit, which it can find already built in Harbor, and creates a new deployment for those images.
When the deployment is completed, traffic is redirected to that new deployment and the previous version is removed from service.

Does that match reality/our understanding more-or-less? I didn't want to change what you had, because I'm not completely confident of my understanding.

ndouglas commented 1 year ago

I fluffed up some sections that seemed a little terse to me, fixed a couple typos, and fleshed out the sections on Access -- I think I know what you meant here, so I went with it because it seemed like a boring section to write 🙂 If I was completely off the mark then feel free to delet this.

olivereri commented 1 year ago

Deployment on Production

A Lagoon scheduled Custom Task triggers a promotion from the Staging environment to Production via curl.
The promotion workflow determines that last successfully deployed commit from the staging environment.
Checkout the Git branch main in order to load the .lagoon.yml and docker-compose.yml files (Lagoon still needs these in order to fully work).
Creates all Kubernetes/OpenShift objects for the defined services in docker-compose.yml , but with LAGOON_GIT_BRANCH=mainas environment variable.
Copy the last successfully deployed Images from the Staging environment and uses them (instead of building Images or tagging them from upstream).
Run all post-rollout tasks like a normal deployment.

Disclaimer: There currently is no Lagoon functionality that mimics or replicates or current Production Deployment system.

Note: For this deployment pipeline design to work as intended Lagoon must be extended. The below listed issues must be resolved before this is possible:

department-of-veterans-affairs / va.gov-cms

Create Lagoon Architecture Documentation draft #13170

Description

Team

Lagoon Documentation

Requirements

Architecture and Design

Assumptions

Requires Amazee.io Support and/or Software Development

Requires DSVA Platform Operations Support:

Local Development

DevOps Engineer

Drupal Engineer

Deployment Process

PR Merge to Deployment

Deployment on Production

Requires Amazee.io Support and/or Software Development:

Todo

Access

Lagoon Dashboard

Lagoon CLI

Drupal

SOCKS

GFE

Service Containers

Requires Amazee.io Support and/or Software Development:

Requires DSVA Platform Operations Support:

Roll Out Plan

Related Issues:

PR Merge to Deployment on Staging

Deployment on Production

Deployment on Production