department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
97 stars 69 forks source link

Capture events/metrics about general CMS product delivery performance. #13948

Closed ndouglas closed 1 year ago

ndouglas commented 1 year ago

Description

There are some metrics that we can and should be capturing for CMS product delivery performance, but that I don't believe we are at present.

We should ensure that these metrics are being recorded and reported upstream to Datadog.

These involve changes to the BRD CD pipeline, so each of these might be nontrivial and need to be split off into a separate issue.

Events

Metrics

Acceptance Criteria

olivereri commented 1 year ago

I like it, I can't think of much else to contribute here. I think this is what Mike Chelen was essentially asking for when we went through Staging deploy and Test epic to reduce the overall time it takes. We just leaned on Jenkins job metrics to inform if we were succeeding. The pitfall there is we really can't go back and point to the data. Whereas we implement this it's a lot more clear and will provide historical context.

Deployment Frequency is a bit boring. But if we did it for Staging versus Production it might help us uncover issues with webhooks firing. If PR merges to main are higher than Staging Deploys that would indicate a problem, being that it should be 1 to 1.

productmike commented 1 year ago

@ndouglas @olivereri I'm good to move forward with this in a hypothesis manner (e.g. we think these metrics will best represent a good 1st slice of measurements based on what we know now). Who all has access to datadog and/or how can we best socialize these measurements in an ongoing manner (once we feel comfortable they are accurate and not especially negative). Though we didn't have a chance to refine together (pretty much only story pointing left), moving into STRETCH for Nate to tear into when he's back next week.

Assuming 8 story points for the purposes of planning

ndouglas commented 1 year ago

I was intending to use events for A, B, C, D, E, then use those events to calculate metrics. I don't think that's going to work well; it would seem to require that I set a tag value for the event to the commit SHA, which would then cause the custom metrics billing to scale according to the number of commits we make! This would be very wasteful financially.

After some thinking, I decided to just use the commit timestamp as stored in the Git history as the start point and compute dates relative to that for each subsequent step.

I don't think this actually changes anything, but wanted to note this for future reference. If we create similar issues in the future, we should beware of this cost scaling complication.

productmike commented 1 year ago

thanks @ndouglas! To be sure I'm tracking, what is the custom metrics billing again (who owns it, how it's used, etc.)?

ndouglas commented 1 year ago

Custom metrics within Datadog. I believe the greater DSVA team owns Datadog, but I don't know how the billing, etc work to be honest. That hasn't complicated our team's life in the past and I don't expect it will in the future, but I could be wrong.

I might be missing what you're asking, though 🙂

ndouglas commented 1 year ago

My PRs above should accomplish the latter four metrics, but not deployment lead time. I'll probably need to open up a followup ticket for that.

ndouglas commented 1 year ago

This is all running and working, just need more data for the dashboard to look interesting and be useful.

productmike commented 1 year ago

Cool @ndouglas -- is this viewable on datadog?

ndouglas commented 1 year ago

@productmike yeppers, check this out. Hopefully you can see that.

If not, it's a dashboard in Datadog called "[CMS] Product Delivery Metrics".