Open andreahewitt-odd opened 2 years ago
We should make a distinction between stability & performance across all API endpoints (including VFS) and stability & performance of Platform owned API endpoints.
For example:
api.va.gov/some/vfs/endpoint
- Target is <= 15 sec
api.va.gov
- Target is <= 2s
Problem statement should be around stability and performance without the context of what's causing it.
break up impact into VFS, Platform and VA.gov users
Problem statement: Overall stability of vets-api, external services are likely cause
User Impact: Reflect significance of each group
Where was this reported? Support channel Incidents Postmortems
How well do we understand the problem?
AC
Success criteria
TODO:
Ok what I'm seeing here is some SLIs that are definitely applicable:
There are also metrics which are meaningful for debugging but not really representative of user impact, such as:
And some metrics which are relevant but not within Platform scope:
The acceptance criteria reflects the goal of having mutually understood and defined Platform SLIs and SLOs for vets-api
which are documented and visible in a single dashboard.
References for SLI and SLO definition in increasing levels of detail:
Some discovery needed about broader stability issues
Can anyone clarify what this means?
@little-oddball It looks like you accidentally overwrote the text of the issue in your last edit, so I have restored previous version.
Note from Platform Leadership FY23 planning onsite: vets-API latency: decomposition on endpoints that will benefit from isolation and scalability.
@ericboehs Eric do you have any ideas about what milestone we should attach to this product?
Removing first AC into its own criteria, story created here
Given the revised scope, here are the remaining tasks needed for completion.
Hey @mchelen-gov, With the recent pivot for this team and some of the team members in general, can you please clarify if you have a thought on when the remaining work would happen and who would be doing it?
Hey @mchelen-gov, With the recent pivot for this team and some of the team members in general, can you please clarify if you have a thought on when the remaining work would happen and who would be doing it?
This project's current scope should be wrapped up before any team pivots. Let's discuss more on Monday.
@mchelen-gov I sync'd with Nate today on the plan y'all discussed yesterday. Sounds good.
Request to post SLO doc to pltform website filed here: https://github.com/department-of-veterans-affairs/va.gov-team/issues/56821
Comment and a question:
No matches for roadmap-DMC
close
Problem Statement
Stability and performance of vets-api gets severely impacted when external services (such as BGS or EVSS) do not behave as expected. This can occur with any of the external services but tends to be a more severe problem with services that we have no governance over, are unowned and/or have a of lack institutional knowledge and expertise behind them. We have done some sleuthing around this, and will continue, with indicators looking at improvements at both an internal and external component level.
User Impact
Platform teams:
VFS teams:
vets-api
affects stability of VFS ownedvets-api
endpointsvets-api
status affects VFS teams ability to troubleshoot theirvets-api
endpointsVeterans/VA.gov users:
Where was this problem reported?
How well do we understand the problem?
vets-api
endpointsWhat is the acceptance criteria?
vets-api
endpointsvets-api
SLOs Draft HereHow should we measure success?
vets-api
SLO is not metvets-api
status through dashboardsTODOs