department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
98 stars 68 forks source link

Spike: Content Build Dashboard #19469

Closed gracekretschmer-metrostar closed 1 week ago

gracekretschmer-metrostar commented 2 weeks ago

User Story or Problem Statement

Before moving forward with creating a content build dashboard, the CMS team needs to understand what is technically feasible with content build data.

Description or Additional Context

Currently, there is a reputation that content build (both the release of content and the editor experience) are often down or incredibly slow. We don’t have a way to quickly show stakeholders the statistics around uptime, deployment times, outages so the reputation of our products is suffering. We do not want to focus on making content release faster in the wake of rolling out Next Build.

Currently, this data lives in Data Dog and Github. We want to evaluate those sources.

Determine if the following questions can be answered with the data:

Acceptance Criteria

timcosgrove commented 1 week ago

It's worth noting that the list of failures shown the Github Actions UI can be misleading. https://github.com/department-of-veterans-affairs/content-build/actions/workflows/content-release.yml?query=is%3Afailure

Failure, but quick recovery Sometimes a failure is very quick. These two are failures, because the CMS failed to respond when its status was checked:

However, in context, these did not take very much time:

So, even though this displayed as two successive failures, it was in a failure state for 7 minutes, a little more than 10% of a single successful content release.

Marked as failure in Github even though successful deploy If any single process in the Github Actions workflow fails, the entire workflow is marked as a failure. This includes processes that take place after deployment is finished.

This workflow run deployed successfully. However, it failed during the 'notify success' step, which lets the CMS know that the workflow is finished: https://github.com/department-of-veterans-affairs/content-build/actions/runs/11297207768

This was otherwise a successful deploy, and was marked as so in Datadog; but, it would appear that the entire content release failed if you don't look more carefully.

This workflow was marked as a failure in the deploy state, but it actually deployed successfully: https://github.com/department-of-veterans-affairs/content-build/actions/runs/11294862100

This one instead failed in the step where the CMS is notified. However, again, this was a successful content deploy.

Failures are often fast and repeat quickly Sometimes there are failure states that do in fact require attention, but the way they manifest in Github Actions is alarming. On September 16, 2024, there was a failure due to configuration on the Github Actions runner and how it was attempting to install Node.js packages. There were 87 marked failures over a 2 hour period. Again, this was not good and needed attention. If the releases had been successful, there would have been only 2 workflow runs during that same period.

The visibility of failed workflows when they are failing quickly and repeatedly can a perception of alarm when the severity may not be that high. We can talk about the same incident in three ways:

All these statements are true. While I'm not intending to say that this wasn't an issue that needed resolving, I am of the opinion that the latter two statements more accurately represent the problem; but, Github Actions puts emphasis on the first statement.

timcosgrove commented 1 week ago

CMS/Github Actions communication as a potential problem Content Release workflow has several touchpoints where the CMS and Github Actions communicate with each other. While these are necessary, to some degree these can be a source of problems.

timcosgrove commented 1 week ago

Sources of information about Content Release

Github Actions

The most direct source of information about the Content Release process is the Github Actions workflow runs: https://github.com/department-of-veterans-affairs/content-build/actions/workflows/content-release.yml

Each individual run is explorable and contains a fair amount of information, including:

However, a few things about the interface make this a source of information that needs some caution:

Datadog

We send a fair amount of information into Datadog. We do this in two ways:

  1. Via custom metrics and events. The Content release workflow calculates its own metrics around timing of various Content Release steps and also events (primarily a success or failure event per workflow run).
  2. Via the Datadog Actions Metrics Action, which sends information about all workflow runs in a particular repo into Datadog: https://github.com/int128/datadog-actions-metrics. This provides further timing information that can be taken advantage of.

Once the information is in Datadog, it can be pulled into visualizations.

Issues with visualization in Datadog:

gracekretschmer-metrostar commented 1 week ago

Thanks, Tim! I am going to put next steps for this work for Tuesday's PM <> PO sync and I think you work here has been sufficient. I'm going to close the ticket.

timcosgrove commented 2 days ago

A POC dashboard is here: https://vagov.ddog-gov.com/dashboard/ihu-nuy-8dj/content-release-overview?fromUser=false&refresh_mode=sliding&from_ts=1729007437441&to_ts=1729612237441&live=true

This just tries to pull together the points asked for in the main part of the ticket in a clearer way.