department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

Infrastructure Protocols - Establish Game-day and Stress Test Protocols #3641

Open ricetj opened 4 years ago

ricetj commented 4 years ago

Description:

Establish gameday to test an aspect of or assertion against the platform. This will include learnings and remediation items as needed.

Pain Points being addressed:

User Story

As a VSP Ops team member, I want to be prepared for incidents that could occur, so that developer on VA.gov would have interruptions when helping veterans.

Goals

Test parts of our system to know how it reacts during an incident.

VSP OKRs

O3: Stability and resiliency of the Platform's systems and teams continue to improve.

KPIs

pnwstevan commented 4 years ago

AWS has recently open-sourced some internal chaos engineering tools which may be relevant/useful for this:

https://github.com/amzn/awsssmchaosrunner

https://aws.amazon.com/blogs/opensource/building-resilient-services-at-prime-video-with-chaos-engineering/

CC: @dginther @ricetj