department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 202 forks source link

Agile COE Metrics (July) #11770

Closed MickinSahni closed 4 years ago

MickinSahni commented 4 years ago

The VA Agile CoE looks for particular delivery metrics:

Lead Time (LT)

The CoE is requesting these data points for 1 month: July 1 - 31, 2020

Results

Current

Tickets closed in 2020-07-01 to 2020-07-31:

image
image

Previous

Tickets closed in 2020-05-23 to 2020-06-19:

https://docs.google.com/spreadsheets/d/1ShapXU22mt0eID42x2187Xf58EON6SSzuhVK8VICRUo/edit#gid=736450740

short000 commented 4 years ago

From: Paul Short pshort@governmentcio.com Date: Thu, Aug 6, 2020 at 5:09 PM Subject: Re: FW: PSC Metrics VA.GOV To: Murrill, Cameren P. (Favor Tech) Cameren.Murrill@va.gov Cc: Salvato, Stephanie (Favor TechConsulting) Stephanie.Salvato@va.gov

After checking back with ops, here are my responses:

 [2] I've answered with Mean-Time-to-Repair (not Mean Time to Restore [a Service]). This is time to repair an emergency defect. 

I‘m not sure what is meant by repair to an emergency defect. Is a repair considered a “pull-out (or backed-out)”. Has your application ever gone down in production? If so, do you have metrics for how long it took to restore the application/services?

Response: a. Mean Time to Repair (MTTR) - closest metric is Percent of Incidents (problems requiring a post-mortem): 100% remediated within seven days, average is 3 days. b. Has the application ever gone down in production - uptime is currently 99.9687% not 100%, so the answer is technically/probably yes. I checked with ops on why: it was due to external factors such as the VA TIC being down, which is outside of our control.

[3] Due to frequency of releases (daily), feature toggles, and lack of a waterfall release definition.

How many releases have you had in the past 6 months, then how many were backed out of those releases?

Response: Short (literal) answer: 732 and 0. An average of 4 deployments to prod a day, no rollbacks. Detailed (more accurate) answer: I can only give you an answer in terms of deployments to production, NOT formal releases, so it's not going to be an apples-to-apples comparison. According to ops, we average 4 deployments to production per day if you count off-schedule deployments (we have a scheduled, daily deployment at a minimum). Those are not equivalent to formal releases. Due to feature flags and small, rapid deployments per day, if errors are found in production, they are fixed, not backed out. As I mentioned before, we do not have the concept of application-wide formal releases as defined by VA PMAS, VIP, or even SAFe. We release fixes and changes daily, features incrementally in two week sprint cycles. VSA has 9 development teams that constantly release features. We are the public front end into the VA that also submits requests to VA enterprise systems, not a backend system of record.

[4] This number isn't very useful because we start with automated unit and e2e tests and then add manual as needed, not the other way around. Automation by itself is not a measure of quality. 

Automation testing increases the depth and scope of testing which directly helps improve software quality. OIT leadership is looking for % of automation to understand where we (OIT) is with DevSecOps.

Response: Not disagreeing with that statement, testing pyramids and all that stuff, but again, by itself without context, % automation is not a measure of quality. As you mentioned, it might help with "% of automation to understand where we (OIT) are with DevSecOps". We started with automation first, then added manual tests, so you'll need that context. As this happens, the graph might move in the opposite direction when compared to systems that started off with manual testing.

  [5] We don't have a formal waterfall/VIP-style release process AFAIK. We release fixes and minor changes daily when needed, and release moderate changes in two-week sprint cycles for each of our teams. Large features and integration efforts may span sprints, but we use feature flags and toggles to deliver the prerequisites earlier. We push changes out as soon as they are ready and tested.

When you say you release fixes and minor changes daily, is this released to end-users or is it just deployed to production servers?

Response: Both (reflected by #3)

n Metric VA Definition VA.GOV Metric
1 Average Lead Time From time an item (story) is created on a product team backlog to the time it is resolved (1 month of data July 1, 2020 – July 31, 2020)    Hint: If you want, send a report to me from your ALM tool which includes all user stories and defects “resolved” after May 23, 2020 20 (irrespective of when they were created or deployed to production). We can calculate the average lead time. The fields in the report/spreadsheet should include Id, Work Item Type, Summary, Status, Creation date, and Resolution date. 20d
2 Defect Lead Time From time a defect is created on a product team backlog to the time it is resolved (1 month of data July 1, 2020 – July 31, 2020)    Hint: If you send this data along with #1 as a report, we don’t need anything else. Otherwise, we need all Resolved Defects after May 23, 2020 (irrespective of when they were created or deployed to production). 19.5
3 Deployment Frequency/Release Cadence to Production Average amount of time (days) between releases to Production (currently we will measure from January 31, 2019 to July 31, 2020)    Hint: include patches. 3 and 5 go together. 1 day [1]
4 Mean Time to Restore How long it takes to restore service for the primary application or service they work on when a service incident (e.g., unplanned outage, service impairment) occurs (current MTTR)    Hint: Hypothesize, today if your system (the system that is under your control) went down; how long would it be for the system to come back up, on average. 3 days [2]
5 Change Fail Percentage Total Production Releases by Month/number of releases that had to be backed out and/or rolled back to a previous build/version. (from January 31, 2019 to July 31, 2020 for releases from cadence, get number failed, rolled back)    Hint: release data back 6 months; 3 and 5 go together. 0% [3]
6 % Test Cases Automated Number of automated test cases divided by (/) total number of test cases. 87% [4]
7 Last Release Date Last release date into IOC prod or production N/A [5]

[1] We automatically release daily, but can also do on-demand releases more frequently when needed. [2] I've answered with Mean-Time-to-Repair (not Mean Time to Restore [a Service]). This is time to repair an emergency defect. [3] Due to frequency of releases (daily), feature toggles, and lack of a waterfall release definition. [4] This number isn't very useful because we start with automated unit and e2e tests and then add manual as needed, not the other way around. Automation by itself is not a measure of quality. [5] We don't have a formal waterfall/VIP-style release process AFAIK. We release fixes and minor changes daily when needed, and release moderate changes in two-week sprint cycles for each of our teams. Large features and integration efforts may span sprints, but we use feature flags and toggles to deliver the prerequisites earlier. We push changes out as soon as they are ready and tested.