department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 198 forks source link

Establish SLOs' for Stability and Performance of Vets API #46778

Open andreahewitt-odd opened 2 years ago

andreahewitt-odd commented 2 years ago

Problem Statement

Stability and performance of vets-api gets severely impacted when external services (such as BGS or EVSS) do not behave as expected. This can occur with any of the external services but tends to be a more severe problem with services that we have no governance over, are unowned and/or have a of lack institutional knowledge and expertise behind them. We have done some sleuthing around this, and will continue, with indicators looking at improvements at both an internal and external component level.

User Impact

Platform teams:

VFS teams:

Veterans/VA.gov users:

Where was this problem reported?

How well do we understand the problem?

What is the acceptance criteria?

How should we measure success?

TODOs

mchelen-gov commented 2 years ago
mchelen-gov commented 2 years ago

We should make a distinction between stability & performance across all API endpoints (including VFS) and stability & performance of Platform owned API endpoints.

For example: api.va.gov/some/vfs/endpoint - Target is <= 15 sec api.va.gov - Target is <= 2s

andreahewitt-odd commented 1 year ago

Problem statement should be around stability and performance without the context of what's causing it.

andreahewitt-odd commented 1 year ago

break up impact into VFS, Platform and VA.gov users

raywangoctova commented 1 year ago

image (2).png

image (3).png

https://vagov.ddog-gov.com/dashboard/uhg-yyw-yyv/erics-dashboard-mon-aug-22-31427-pm?from_ts=1663691796956&to_ts=1663695396956&live=true

mchelen-gov commented 1 year ago

Problem statement: Overall stability of vets-api, external services are likely cause

User Impact: Reflect significance of each group

Where was this reported? Support channel Incidents Postmortems

How well do we understand the problem?

AC

Success criteria

TODO:

mchelen-gov commented 1 year ago

https://vagov.ddog-gov.com/dashboard/uhg-yyw-yyv/erics-dashboard-mon-aug-22-31427-pm?from_ts=1663691796956&to_ts=1663695396956&live=true

Ok what I'm seeing here is some SLIs that are definitely applicable:

There are also metrics which are meaningful for debugging but not really representative of user impact, such as:

And some metrics which are relevant but not within Platform scope:

The acceptance criteria reflects the goal of having mutually understood and defined Platform SLIs and SLOs for vets-api which are documented and visible in a single dashboard.

mchelen-gov commented 1 year ago

References for SLI and SLO definition in increasing levels of detail:

annekerr49 commented 1 year ago

https://app.zenhub.com/files/133843125/91f4ee9a-797c-4dff-ad90-524cde55ff4f/download @andreahewitt-odd

mchelen-gov commented 1 year ago

Some discovery needed about broader stability issues

Can anyone clarify what this means?

mchelen-gov commented 1 year ago

@little-oddball It looks like you accidentally overwrote the text of the issue in your last edit, so I have restored previous version.

raywangoctova commented 1 year ago

Note from Platform Leadership FY23 planning onsite: vets-API latency: decomposition on endpoints that will benefit from isolation and scalability.

annekerr49 commented 1 year ago

@ericboehs Eric do you have any ideas about what milestone we should attach to this product?

npeterson54 commented 1 year ago

Removing first AC into its own criteria, story created here

mchelen-gov commented 1 year ago

Given the revised scope, here are the remaining tasks needed for completion.

jwoodman5 commented 1 year ago

Hey @mchelen-gov, With the recent pivot for this team and some of the team members in general, can you please clarify if you have a thought on when the remaining work would happen and who would be doing it?

mchelen-gov commented 1 year ago

Hey @mchelen-gov, With the recent pivot for this team and some of the team members in general, can you please clarify if you have a thought on when the remaining work would happen and who would be doing it?

This project's current scope should be wrapped up before any team pivots. Let's discuss more on Monday.

jwoodman5 commented 1 year ago

@mchelen-gov I sync'd with Nate today on the plan y'all discussed yesterday. Sounds good.

BillChapmanUSDS commented 1 year ago

Request to post SLO doc to pltform website filed here: https://github.com/department-of-veterans-affairs/va.gov-team/issues/56821

acrollet commented 1 year ago

Comment and a question:

annekerr49 commented 1 year ago

No matches for roadmap-DMC

annekerr49 commented 1 year ago

close