Capture events/metrics about general CMS product delivery performance.

ndouglas commented 1 year ago

Description

There are some metrics that we can and should be capturing for CMS product delivery performance, but that I don't believe we are at present.

We should ensure that these metrics are being recorded and reported upstream to Datadog.

These involve changes to the BRD CD pipeline, so each of these might be nontrivial and need to be split off into a separate issue.

Events

A: PR merged to main -> commit, timestamp
B: staging post-deploy tests started -> commit, timestamp
C: staging post-deploy tests succeeded -> commit, timestamp
D: staging post-deploy tests failed -> commit, timestamp
E: release created -> list of commits, timestamp?

Metrics

Deployment Lead Time - E - A
Delivery Lead Time - C - A
Change Failure Rate - D / B or ( B - C ) / B
Mean Time to Restore - difference of the time between a change failure and the next subsequent success? because of sporadic commit schedule this might be expressed better as number of subsequent failures.
Deployment Frequency - this is kinda boring because we do 1 per day, and I'm not comfortable escalating that for Drupal in the near future. But maybe 👉🏻 👈🏻 🥺 what if we kissed by the continuous deployment banner?

Acceptance Criteria

[ ] Above events/metrics are reported to Datadog

olivereri commented 1 year ago

I like it, I can't think of much else to contribute here. I think this is what Mike Chelen was essentially asking for when we went through Staging deploy and Test epic to reduce the overall time it takes. We just leaned on Jenkins job metrics to inform if we were succeeding. The pitfall there is we really can't go back and point to the data. Whereas we implement this it's a lot more clear and will provide historical context.

Deployment Frequency is a bit boring. But if we did it for Staging versus Production it might help us uncover issues with webhooks firing. If PR merges to main are higher than Staging Deploys that would indicate a problem, being that it should be 1 to 1.

productmike commented 1 year ago

@ndouglas @olivereri I'm good to move forward with this in a hypothesis manner (e.g. we think these metrics will best represent a good 1st slice of measurements based on what we know now). Who all has access to datadog and/or how can we best socialize these measurements in an ongoing manner (once we feel comfortable they are accurate and not especially negative). Though we didn't have a chance to refine together (pretty much only story pointing left), moving into STRETCH for Nate to tear into when he's back next week.

Assuming 8 story points for the purposes of planning

ndouglas commented 1 year ago

I was intending to use events for A, B, C, D, E, then use those events to calculate metrics. I don't think that's going to work well; it would seem to require that I set a tag value for the event to the commit SHA, which would then cause the custom metrics billing to scale according to the number of commits we make! This would be very wasteful financially.

After some thinking, I decided to just use the commit timestamp as stored in the Git history as the start point and compute dates relative to that for each subsequent step.

I don't think this actually changes anything, but wanted to note this for future reference. If we create similar issues in the future, we should beware of this cost scaling complication.

productmike commented 1 year ago

thanks @ndouglas! To be sure I'm tracking, what is the custom metrics billing again (who owns it, how it's used, etc.)?

ndouglas commented 1 year ago

Custom metrics within Datadog. I believe the greater DSVA team owns Datadog, but I don't know how the billing, etc work to be honest. That hasn't complicated our team's life in the past and I don't expect it will in the future, but I could be wrong.

I might be missing what you're asking, though 🙂

ndouglas commented 1 year ago

My PRs above should accomplish the latter four metrics, but not deployment lead time. I'll probably need to open up a followup ticket for that.

ndouglas commented 1 year ago

This is all running and working, just need more data for the dashboard to look interesting and be useful.

productmike commented 1 year ago

Cool @ndouglas -- is this viewable on datadog?

ndouglas commented 1 year ago

@productmike yeppers, check this out. Hopefully you can see that.

If not, it's a dashboard in Datadog called "[CMS] Product Delivery Metrics".

department-of-veterans-affairs / va.gov-cms