department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

Automate EKS Cluster Deployment and Tear Down #50671

Open raywangoctova opened 1 year ago

raywangoctova commented 1 year ago

Problem Statement

Currently the EKS clusters that host vets-api and other applications are provisioned in manual, time intensive ways. This creates overhead when Platform teams are performing updates, maintenance, and in the overall management of these clusters. These tasks include:

In addition, developers do not have the ability to spin up and tear down clusters without risk of creating orphaned resources.

Our hypothesis is that we can solve these problems by implementing improved automation for the deployment, management, and decommissioning of these EKS clusters.

This would support the goals of zero downtime deploys using blue/green deployment, and facilitate the ongoing maintenance and updates of EKS control plane.

User Impact

Where was this problem reported?

What do we not know about the problem space?

What areas do we need more information about so that we can accurately solve the problem? What discovery is needed?

What (if any) research or discovery has been done?

This has been a topic since mid to late 2022, and has undergone extensive research into use cases that today we cannot accommodate with our existing infrastructure

What is the acceptance criteria?

How should we measure success?

Time to deploy a new EKS cluster Time to teardown a new EKS cluster Time to perform blue/green deploy of EKS cluster of orphaned resources remaining after EKS cluster teardown of manual actions required to update EKS cluster control plane version

TODOs

mchelen-gov commented 1 year ago

Which of these other issues would be covered by this?

npeterson54 commented 1 year ago

Action items:

npeterson54 commented 1 year ago

Random thoughts; mostly just reiterating what was said yesterday. To avoid scope creep, I think we should break the last two AC out into their own project. They make an awesome project and do not fit into the problem statement of the issue.

EWashb commented 1 year ago

@npeterson54 I'm reading through all the documentation so that I can understand the landscape here. I'm curious if yall have given more thought intro the metrics of success that we can track. I see above you mentioned # of failures per month, but would love to jam about your ideas!

mchelen-gov commented 1 year ago

Quick process note, @EWashb should be the only one adding the "PO Endorsed" label to this project. Erika as soon as you feel it is ready from a refinement perspective, please re-add. Thanks everyone for understanding!

little-oddball commented 1 year ago

If we are taking the PO Endorsed label off that indicates that no work should be being performed on the effort. This also means the effort should NOT be in the In Progress column. In a previous discussion w/ Bill and Mike C it was communicated that we SHOULD move forward. @EWashb , happy to chat about this and remove all the confusion, etc.

EWashb commented 1 year ago

This looks good to me. Thank you @npeterson54 for getting those metrics defined!

EWashb commented 1 year ago

@little-oddball thanks for you comments. Everything is fine from my perspective to start this work. Since I am now onboarded, hopefully that will not be a blocker or a source of confusion again.

annekerr49 commented 1 year ago

Erika is this is your ticket could you please add it to the DE Procuts -In progress column ? https://github.com/orgs/department-of-veterans-affairs/projects/940/views/2

EWashb commented 1 year ago

@annekerr49 it has now been added to the DE board.

EWashb commented 1 year ago

@npeterson54 as we are completing tasks within this epic, can you be sure to check off the ACs listed so that I can follow along with the progress? Thank you.

EWashb commented 10 months ago

@AparnaNittalaUSDS going through our DE product board and I added the start date to this since I was still POing this team at that time. However, I don't know the target delivery date for this. Could you add it here?

annekerr49 commented 10 months ago

Aparna is checking with Nate Peterson as to status.

AparnaNittalaUSDS commented 10 months ago

@annekerr49 Jan 2024 is the latest date for the completion of this activity. Confirmed with Nate

AparnaNittalaUSDS commented 4 months ago

Updating the target delivery date to February 28, 2025

annekerr49 commented 1 month ago

What is the reason this project is targeting Feb 28, 2015 as end date?

ph-One commented 1 month ago

This work has been on hold for many months while all of Platform was being migrated to Amazon Linux 2 based images (Amazon Linux 1 is end of life). We are beginning to get back to this work.