[Application Hosting and Deployment] Provide developers with tools to troubleshoot in EKS

jhouse-solvd commented 2 years ago

Product Outline

Application Hosting and Deployment using Container Orchestration (EKS)

High-Level User Story/ies

As a platform crew developer, I need to be able to shell into an application hosted in EKS for the purposes of debugging and troubleshooting. As a platform crew developer, I need permission to manage Kube/ArgoCD resources for an application.

Hypothesis or Bet

If we make this change, we expect that Platform Crew developers can successfully connect to their applications running in EKS If we make this change, we expect that Platform Crew developers can troubleshoot their applications running in EKS

Definition of done

What must be true in order for you to consider this epic complete?

Platform Crew developers have the permissions needed to connect to their applications running in EKS
Platform Crew developers can connect to their applications in EKS
There is documentation for Platform Crew developers explaining how to connect to applications in EKS for the purposes of troubleshooting

olivereri commented 2 years ago

3 main difficulties:

How do I best deploy a solution to EKS?
What steps or process provides the quickest update/change to deploy/verify loop
What steps or process provides the quickest way to verify changes beyond what Argo CD provides in UI logs.

1. How do I best deploy a solution to EKS?

This one is probably out of scope for the epic.

Specifically for my use case with Sorry-Cypress I learned to effectively nest the chart I wanted to deploy. If I created a chart in the Application Manifest repository that called an upstream, publicly available chart repo, as a dependency would be easiest way. This pattern wasn't immediately clear to me. I initially tried to commit the entirety of the upstream's charts.

Not everything is going to have charts. Does it make sense of a custom application with no helm chart to create one? For example, would it make sense for the CMS team to create helm charts or just create the k8s resources explicitly in yaml files?

What are the deployment patterns for new applications that do and do not have helm charts?

2. Quickest Change to Deploy Loop?

For a new application deployed to EKS it can be a struggle to make tweaks large and small that put the app in a working state. Being able to quickly make a change, and verify it works as intended is tough with a PR approval/merge process. I ended up using a bot account to approve my own PRs so I could make changes and merge them quickly. Even backing out bad changes.

I know now that one pattern is to turn off Auto-synchronize on Argo CD to prevent it from enforcing git ops, then editing templates within Argo CD or using kubectl directly on the console. That should provide quick feedback that a change will have the intended impact and is safe to put in a PR. After the PR is merged the app in Argo CD can be manually syncd.

3. Quickest Way to Verify Changes Beyond what Argo CD Provides in UI logs.

There are some ways an application can fail or not work as intended. The cause isn't visible in logs, or logs are only a pointer to what caused the problem but not the cause itself. Being able to SSH into a pod/container provides instant feedback. I had to dig around https://vfs.atlassian.net/wiki/spaces to find what I needed to config Kubectl. I eventually found this which when paired with my AWS Account credentials got me into a pod.

I'm comfortable with just a terminal and shell to get what I needed done. The Kubernetes dashboard is no longer being maintained, but that has a bit more information than Argo CD. For example, the value of environment variables isn't stuffed in json output and is viewable even if it was a secret.

Other Reference

Not everything in this section is relevant to developer tools but it may contain our thoughts on tools, or their non-existence when weighing whether we should migrate to EKS.

The decision document to remain on BRD for a while: https://vfs.atlassian.net/wiki/spaces/ECP/pages/2043576463/CMS+Maintain+current+BRD+on+Amazon+Linux+2 https://github.com/department-of-veterans-affairs/va.gov-cms/issues/7330

An exceptional write up on Lagoon vs Argo CD on EKS: https://github.com/department-of-veterans-affairs/va.gov-cms/issues/6673

Markdown Version of Platform-CMS Team's Hosting Decision Matrix (Googlesheet)

	BRD on Amazon Linux 2	Waypoint	Lagoon	AWS Elastic Beanstalk	ArgoCD + GHA
Details	Current state, already works	POC not started, a lot of assumptions https://www.waypointproject.io/	POC already started, good findings	Some team members have prior experience	Essentially a do-it-yourself platform
Supported Features
Pros	- Can support 1 container instance - 1-4 months work - Supports newer versions of PHP	- open source (https://github.com/hashicorp/waypoint) - Hashicorp - proven to ship good stuff - we can split the build/deploy steps. build can happen in GHA - Hashicorp support > https://www.hashicorp.com/customer-success#compare-plans	- Drupal aware - Open source - Amazee company has been around for 15+ years - Proven to work for Drupal application flow, Austrailian GovCMS (https://www.govcms.gov.au/) - 1-4 months of work - xdebug on PROD - PROD container runs on LOCAL	- Generally falls within existing skillset and core competencies - Changes a small number of technologies	- Supported by platform operations - Public + private runners
Cons	- Need to update AL1 (Amazon Machine Image) to AL2 - Need to maintain Jenkins (or similar), Jenkins will be sunset in the near future	- tool is pretty early in its development	- POC next steps can be time consuming - K8s based needs expertise/ learning curve - Too big just for CMS	- A lot of functionality gaps to fill	- Many gaps to fill - Estimate 6-12 months of work
Open Questions	- Ask Ops team if any existing blockers on this.	- Cam needs AWS, EKS access to proceed	- Blocked: TIC boundary presents issues, networking issues	- Can this be run locally? NO lol	-
Next Steps	Explore if any previous AML2 work has been done for reference (Eric)	Need to start POC in order to clarify assumptions - Sprint 46 (Cam)	Proceed with evaluating Lagoon, pending cleared blocker (Nate)	Nate to take the lead on POC, possibly Sprint 4647?	Meet with Patrick V. (Platform) to understand how his team is approaching BRD to Argo migration (Elijah)
POC Ticket #	BRD/AL2 POC	WayPoint POC	Set up Lagoon on EKS	We should discuss whether this is a good thing to move forward with or not, given that it has a fairly poor score - Nate
Implementation time 1 - high, 5 - minimal	5	3	3	2	1
Overall lifecycle maintenance 1 - high, 5 - minimal	4	4	3	2	2
User experience 1 - poor 5 - great	4	4	5	3	2
Developer experience 1 - poor 5 - great	4	4	4	3	2
Cost effectiveness 1 - low, 5 - high	5	3	4	2	2
Open source 1 - no, 3 - source avail, 5 - yes	3	5	5	1	3
Risk 1-high, 5-low	5	4	3	3	1
Overall Score (highest is best)	56	49	48	30	23

jbritt1 commented 1 year ago

The work we are currently doing around GitHub team synchronization with ArgoCD AppProject permissions and policies should solve this for us.

jhouse-solvd commented 1 year ago

There are a couple of issues in the review column that need to be tidied up. Then, this epic can likely be closed! :)

jhouse-solvd commented 1 year ago

Discussed this w/ @ph-One today. The definition of done has been met. This epic can be closed! :)

department-of-veterans-affairs / va.gov-team