Spike: Content Build Dashboard

gracekretschmer-metrostar commented 2 weeks ago

User Story or Problem Statement

Before moving forward with creating a content build dashboard, the CMS team needs to understand what is technically feasible with content build data.

Description or Additional Context

Currently, there is a reputation that content build (both the release of content and the editor experience) are often down or incredibly slow. We don’t have a way to quickly show stakeholders the statistics around uptime, deployment times, outages so the reputation of our products is suffering. We do not want to focus on making content release faster in the wake of rolling out Next Build.

Currently, this data lives in Data Dog and Github. We want to evaluate those sources.

Determine if the following questions can be answered with the data:

The amount of 2 or more sequential failures with content release (adjustable by time period)
The amount of successful content releases (adjustable by time period)
The average time (adjustable by time period) with each content release

Acceptance Criteria

[x] A technical recommendation on which data questions to answer in a content build dashboard.

timcosgrove commented 1 week ago

It's worth noting that the list of failures shown the Github Actions UI can be misleading. https://github.com/department-of-veterans-affairs/content-build/actions/workflows/content-release.yml?query=is%3Afailure

Failure, but quick recovery Sometimes a failure is very quick. These two are failures, because the CMS failed to respond when its status was checked:

However, in context, these did not take very much time:

End time of previous successful release: Fri, 11 Oct 2024 19:08:15 GMT
Start time for first failure: Fri, 11 Oct 2024 19:08:43 GMT
End time for second failure: Fri, 11 Oct 2024 19:10:52 GMT
Begin time of next successful release: Fri, 11 Oct 2024 19:15:34 GMT

So, even though this displayed as two successive failures, it was in a failure state for 7 minutes, a little more than 10% of a single successful content release.

Marked as failure in Github even though successful deploy If any single process in the Github Actions workflow fails, the entire workflow is marked as a failure. This includes processes that take place after deployment is finished.

This workflow run deployed successfully. However, it failed during the 'notify success' step, which lets the CMS know that the workflow is finished: https://github.com/department-of-veterans-affairs/content-build/actions/runs/11297207768

This was otherwise a successful deploy, and was marked as so in Datadog; but, it would appear that the entire content release failed if you don't look more carefully.

This workflow was marked as a failure in the deploy state, but it actually deployed successfully: https://github.com/department-of-veterans-affairs/content-build/actions/runs/11294862100

This one instead failed in the step where the CMS is notified. However, again, this was a successful content deploy.

Failures are often fast and repeat quickly Sometimes there are failure states that do in fact require attention, but the way they manifest in Github Actions is alarming. On September 16, 2024, there was a failure due to configuration on the Github Actions runner and how it was attempting to install Node.js packages. There were 87 marked failures over a 2 hour period. Again, this was not good and needed attention. If the releases had been successful, there would have been only 2 workflow runs during that same period.

The visibility of failed workflows when they are failing quickly and repeatedly can a perception of alarm when the severity may not be that high. We can talk about the same incident in three ways:

There were 87 failures in a 2 hour period
Content did not go out for 2 hours
Two scheduled content releases did not go out.

All these statements are true. While I'm not intending to say that this wasn't an issue that needed resolving, I am of the opinion that the latter two statements more accurately represent the problem; but, Github Actions puts emphasis on the first statement.

timcosgrove commented 1 week ago

CMS/Github Actions communication as a potential problem Content Release workflow has several touchpoints where the CMS and Github Actions communicate with each other. While these are necessary, to some degree these can be a source of problems.

CMS is responsible for sending an API call to Github Actions to initiate the Content Release workflow
GHA communicates back to the CMS several times during the Content Release process, apart from the main GraphQL query. Each of these communications is preceded by a 'Wait for CMS' workflow step that makes sure the CMS is available. These steps are a frequent source of slowdown and job failure. The CMS may be unreachable for any number of reasons; most frequently these failures happen during the daily CMS deploy.

timcosgrove commented 1 week ago

Sources of information about Content Release

Github Actions

The most direct source of information about the Content Release process is the Github Actions workflow runs: https://github.com/department-of-veterans-affairs/content-build/actions/workflows/content-release.yml

Each individual run is explorable and contains a fair amount of information, including:

Start time of the workflow
time to complete
timing information about individual steps (everything is timestamped if you dig into it)
logs for all steps

However, a few things about the interface make this a source of information that needs some caution:

All times listed in the main overview are relative, i.e. "20 hours ago", "yesterday", "two weeks ago". This makes searching for detailed information about issues with content release a bit more difficult. The actual time can be found if you hover over the relative time, so the information is there; it is just hidden.
The length of time is misleading because it includes the entire time a job has been waiting to run before it starts. Since we don't run multiple jobs at once, a job that gets triggered can sit there waiting for some time while a previous job finishes up. This time gets added to the listed total, which can lead to misperceptions
Getting more nuanced information requires some digging into individual workflow steps, turning on timestamps, etc.

Datadog

We send a fair amount of information into Datadog. We do this in two ways:

Via custom metrics and events. The Content release workflow calculates its own metrics around timing of various Content Release steps and also events (primarily a success or failure event per workflow run).
Via the Datadog Actions Metrics Action, which sends information about all workflow runs in a particular repo into Datadog: https://github.com/int128/datadog-actions-metrics. This provides further timing information that can be taken advantage of.

Once the information is in Datadog, it can be pulled into visualizations.

Issues with visualization in Datadog:

Data has to get into Datadog in the first place. Occasionally the workflow runs fail before sending data into Datadog. This is unusual, but it does happen.
It can be hard to correlate all the data in Datadog into shapes that tell a story. Datadog is mostly intended for certain kinds of data in large volumes, and it is very opinionated about which ways are useful to display data. If what you are trying to examine does not fit the paradigms that Datadog has come up with for its visualizations, there is not really any way to harness the existing data and build something more custom with it.
Datadog access at VA is limited. It can be hard to get access to Datadog at all; and, Datadog data is not visible to users who do not have Datadog access. This means that external users of the CMS who do not have Datadog access cannot see this data.

gracekretschmer-metrostar commented 1 week ago

Thanks, Tim! I am going to put next steps for this work for Tuesday's PM <> PO sync and I think you work here has been sufficient. I'm going to close the ticket.

timcosgrove commented 2 days ago

A POC dashboard is here: https://vagov.ddog-gov.com/dashboard/ihu-nuy-8dj/content-release-overview?fromUser=false&refresh_mode=sliding&from_ts=1729007437441&to_ts=1729612237441&live=true

This just tries to pull together the points asked for in the main part of the ticket in a clearer way.

department-of-veterans-affairs / va.gov-cms