department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
284 stars 206 forks source link

Automate EKS Cluster Deployment and Tear Down #50671

Open raywangoctova opened 1 year ago

raywangoctova commented 1 year ago

Status

Update each sprint until completed Date Status Launch Date (see above) Notes
11/18/24 In-Progress On-Track No new updates.
10/28/24 In-Progress On-Track Team is still reviewing EKS1.31 upgrade, which introduces some changes that we need to account for.
10/21/24 In-Progress On-Track Argo authentication updates and reviewing changes needed for EKS 1.31 (latest EKS release).
10/14/24 In-Progress On-Track Breaking out authentication and improving scaling operations (lowering time it takes for applications and the cluster to scale).
10/7/24 In-Progress On-Track Continuing to move everything into ECR as well as constant small improvements.
9/30/24 In-Progress On-Track Moving the remaining couple of images into ECR. Also modifying warm pool and over provisioning for faster scaling operations.
9/23/24 In-Progress On-Track v.Next clusters are getting continuous upgrades.
9/16/24 In-Progress On-Track Enabled the ability to not expose Argo, per Security request.
9/9/24 In-Progress On-Track Team is testing a new Argo change to prevent the Argo UI from getting deployed for Production clusters to help with Security concerns regarding prod access via SOCKS. This is for new clusters, and new Argo, not existing.
9/3/24 In-Progress On-Track Moving to protect Argo in Production based on suggestions. Have deployed and torn down hundreds of clusters thus far.
8/26/24 In-Progress On-Track With the new IPs provided by VAEC last week, the team has added them to DEV. Currently using the new 100.x CIDR blocks for our new (Platform v.Next) clusters.
8/19/24 In-Progress On-Track Team had a meeting with VAEC, they are going to give us 3 100.x/24 CIDRs for our dev VPC. This should be enough to unblock our work, and we can adjust from there if more are needed.
8/6/24 In-Progress On-Track Reviewing the private IPs and working through what is needed for the VPCs and the existing PPCs with Reliability.

Problem Statement

Currently the EKS clusters that host vets-api and other applications are provisioned in manual, time intensive ways. This creates overhead when Platform teams are performing updates, maintenance, and in the overall management of these clusters. These tasks include:

In addition, developers do not have the ability to spin up and tear down clusters without risk of creating orphaned resources.

Our hypothesis is that we can solve these problems by implementing improved automation for the deployment, management, and decommissioning of these EKS clusters.

This would support the goals of zero downtime deploys using blue/green deployment, and facilitate the ongoing maintenance and updates of EKS control plane.

User Impact

Where was this problem reported?

What do we not know about the problem space?

What areas do we need more information about so that we can accurately solve the problem? What discovery is needed?

What (if any) research or discovery has been done?

This has been a topic since mid to late 2022, and has undergone extensive research into use cases that today we cannot accommodate with our existing infrastructure

What is the acceptance criteria?

How should we measure success?

Time to deploy a new EKS cluster Time to teardown a new EKS cluster Time to perform blue/green deploy of EKS cluster of orphaned resources remaining after EKS cluster teardown of manual actions required to update EKS cluster control plane version

mchelen-gov commented 1 year ago

Which of these other issues would be covered by this?

npeterson54 commented 1 year ago

Action items:

npeterson54 commented 1 year ago

Random thoughts; mostly just reiterating what was said yesterday. To avoid scope creep, I think we should break the last two AC out into their own project. They make an awesome project and do not fit into the problem statement of the issue.

EWashb commented 1 year ago

@npeterson54 I'm reading through all the documentation so that I can understand the landscape here. I'm curious if yall have given more thought intro the metrics of success that we can track. I see above you mentioned # of failures per month, but would love to jam about your ideas!

mchelen-gov commented 1 year ago

Quick process note, @EWashb should be the only one adding the "PO Endorsed" label to this project. Erika as soon as you feel it is ready from a refinement perspective, please re-add. Thanks everyone for understanding!

little-oddball commented 1 year ago

If we are taking the PO Endorsed label off that indicates that no work should be being performed on the effort. This also means the effort should NOT be in the In Progress column. In a previous discussion w/ Bill and Mike C it was communicated that we SHOULD move forward. @EWashb , happy to chat about this and remove all the confusion, etc.

EWashb commented 1 year ago

This looks good to me. Thank you @npeterson54 for getting those metrics defined!

EWashb commented 1 year ago

@little-oddball thanks for you comments. Everything is fine from my perspective to start this work. Since I am now onboarded, hopefully that will not be a blocker or a source of confusion again.

annekerr49 commented 1 year ago

Erika is this is your ticket could you please add it to the DE Procuts -In progress column ? https://github.com/orgs/department-of-veterans-affairs/projects/940/views/2

EWashb commented 1 year ago

@annekerr49 it has now been added to the DE board.

EWashb commented 1 year ago

@npeterson54 as we are completing tasks within this epic, can you be sure to check off the ACs listed so that I can follow along with the progress? Thank you.

EWashb commented 1 year ago

@AparnaNittalaUSDS going through our DE product board and I added the start date to this since I was still POing this team at that time. However, I don't know the target delivery date for this. Could you add it here?

annekerr49 commented 1 year ago

Aparna is checking with Nate Peterson as to status.

AparnaNittalaUSDS commented 1 year ago

@annekerr49 Jan 2024 is the latest date for the completion of this activity. Confirmed with Nate

AparnaNittalaUSDS commented 7 months ago

Updating the target delivery date to February 28, 2025

annekerr49 commented 4 months ago

What is the reason this project is targeting Feb 28, 2015 as end date?

ph-One commented 4 months ago

This work has been on hold for many months while all of Platform was being migrated to Amazon Linux 2 based images (Amazon Linux 1 is end of life). We are beginning to get back to this work.