Closed gracekretschmer-metrostar closed 1 week ago
It's worth noting that the list of failures shown the Github Actions UI can be misleading. https://github.com/department-of-veterans-affairs/content-build/actions/workflows/content-release.yml?query=is%3Afailure
Failure, but quick recovery Sometimes a failure is very quick. These two are failures, because the CMS failed to respond when its status was checked:
However, in context, these did not take very much time:
So, even though this displayed as two successive failures, it was in a failure state for 7 minutes, a little more than 10% of a single successful content release.
Marked as failure in Github even though successful deploy If any single process in the Github Actions workflow fails, the entire workflow is marked as a failure. This includes processes that take place after deployment is finished.
This workflow run deployed successfully. However, it failed during the 'notify success' step, which lets the CMS know that the workflow is finished: https://github.com/department-of-veterans-affairs/content-build/actions/runs/11297207768
This was otherwise a successful deploy, and was marked as so in Datadog; but, it would appear that the entire content release failed if you don't look more carefully.
This workflow was marked as a failure in the deploy state, but it actually deployed successfully: https://github.com/department-of-veterans-affairs/content-build/actions/runs/11294862100
This one instead failed in the step where the CMS is notified. However, again, this was a successful content deploy.
Failures are often fast and repeat quickly Sometimes there are failure states that do in fact require attention, but the way they manifest in Github Actions is alarming. On September 16, 2024, there was a failure due to configuration on the Github Actions runner and how it was attempting to install Node.js packages. There were 87 marked failures over a 2 hour period. Again, this was not good and needed attention. If the releases had been successful, there would have been only 2 workflow runs during that same period.
The visibility of failed workflows when they are failing quickly and repeatedly can a perception of alarm when the severity may not be that high. We can talk about the same incident in three ways:
All these statements are true. While I'm not intending to say that this wasn't an issue that needed resolving, I am of the opinion that the latter two statements more accurately represent the problem; but, Github Actions puts emphasis on the first statement.
CMS/Github Actions communication as a potential problem Content Release workflow has several touchpoints where the CMS and Github Actions communicate with each other. While these are necessary, to some degree these can be a source of problems.
The most direct source of information about the Content Release process is the Github Actions workflow runs: https://github.com/department-of-veterans-affairs/content-build/actions/workflows/content-release.yml
Each individual run is explorable and contains a fair amount of information, including:
However, a few things about the interface make this a source of information that needs some caution:
We send a fair amount of information into Datadog. We do this in two ways:
Once the information is in Datadog, it can be pulled into visualizations.
Issues with visualization in Datadog:
Thanks, Tim! I am going to put next steps for this work for Tuesday's PM <> PO sync and I think you work here has been sufficient. I'm going to close the ticket.
A POC dashboard is here: https://vagov.ddog-gov.com/dashboard/ihu-nuy-8dj/content-release-overview?fromUser=false&refresh_mode=sliding&from_ts=1729007437441&to_ts=1729612237441&live=true
This just tries to pull together the points asked for in the main part of the ticket in a clearer way.
User Story or Problem Statement
Before moving forward with creating a content build dashboard, the CMS team needs to understand what is technically feasible with content build data.
Description or Additional Context
Currently, there is a reputation that content build (both the release of content and the editor experience) are often down or incredibly slow. We don’t have a way to quickly show stakeholders the statistics around uptime, deployment times, outages so the reputation of our products is suffering. We do not want to focus on making content release faster in the wake of rolling out Next Build.
Currently, this data lives in Data Dog and Github. We want to evaluate those sources.
Determine if the following questions can be answered with the data:
Acceptance Criteria