Closed BillChapmanUSDS closed 1 year ago
@BillChapmanUSDS I believe this is in-progress, but did not want to move it without chatting with you. Can you please move it unless you disagree? Thanks.
I believe this ticket might also need some tweaking based on conversations we have had on approach just to be more aligned.
Current Status: SRE Advsory Group: Kyle, Lyndsay, Eiric.
Eric will be serving as SRE CoP lead.
Kickoff meeting to happen week of 4/ 16
and Clint...
The SRE kickoff meeting happens tomorrow.
KIck off meeting has been completed. Currently working on Cademce of the next few touchpoints.
Next meeting will be the Week of June 8th, also a project board has been created. https://github.com/orgs/department-of-veterans-affairs/projects/927/views/6
Problem Statement
Site Reliability is an ongoing concern across VA.gov properties. To bring platform in inline with the wider SRE intitiatives happening at VA, we need to build a Site Reliability team that serves to maintain best practices for SRE across Vets-API projects.
Beginning with a 6 - 8 month discovery sprint this will be the foundational period for the SRE Community of Practice. We need 2 SRE who focus on SRE topics, to help us build out Vets API specific SRE initiatives and culture. Each of these 2 SRE will also be able to step in to the SRE (CoP) lead role that is being filled by OCTO in the event of their absence or unavailability.
Each of the four primary project teams will also have an SRE lead hat, this person would attend the weekly SRE standups, bring the issues their team is facing to SRE awareness, help build out SRE related projects on the team when prioritized.
A Note on SRE: A function created at Google, SRE is about driving shifts in how teams operate across a company. SRE teams are responsible for building automated solutions for operational aspects, such as incident response, on-call duties and performance monitoring. This does not mean that SRE teams are the front line response. SRE exists so that front line incident response is less common.
User Impact
Reliability issues effect every veteran using VA.gov services.
What do we not know about the problem space?
What (if any) research or discovery has been done?
Building SRE Teams Adopting SLOs SRE Priorities and Initiatives
What is the acceptance criteria?
How should we measure success?
TODOs