department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
97 stars 69 forks source link

2021-12-27 Content release issues #7410

Closed timcosgrove closed 2 years ago

timcosgrove commented 2 years ago

Summary

Two separate issues interfered with content release on 2021-12-27. Both were eventually resolved and content release was restored.

Github Actions failure

Github Actions experienced degraded performance on 2021-12-27. Initially this resulted in content release failing and not reporting as failed. Later, a content release became 'stuck' running. Because a running content release prevents subsequent releases from running until it is finished, this effectively blocked content release.

Neil Hastings reached out to FE Tools oncall for help (https://dsva.slack.com/archives/C0MQ281DJ/p1640615811165900) FE Tools in turn reached out to Github for support (https://dsva.slack.com/archives/CU1E4CX9U/p1640623379271800). After some time, Github was able to identify the issue and resolve it. Further, the Github Actions team is going to add a tool to allow VA OIT Ops to cancel actions in the event this kind of thing happens again.

Platform CMS followup

Nothing should be necessary for the time being. The issue did not arise due to shortcomings on our end; our team identified and reported the issue through expected channels; and the issue was resolved by another team.

GI Bill Comparison Tool URL change.

As part of normal work, the AFS Education team submitted a URL change for the GI Bill Comparison Tool. The work done accounted for the URL change within the content-build and vets-websites repos, but could not account for CMS content that links to the old URL. This triggered a significant amount of broken links, in excess of 10 links, breaking content release.

Platform CMS worked with the engineers in question to help restore content release. We reverted the changes, which took a significant amount of footwork due to the holidays and code freeze. We worked out a plan with the engineers in question to safely make this transition, which will need to include the CMS team coordinating an update of CMS content that links to the old URL.

Potential followup

  1. There is no established process for non-CMS teams to alert CMS that URL changes are coming. It may be that CMS is not part of consideration when planning a URL change like this. We may want to work to integrate awareness of and coordination with CMS into processes like redirect requests (see https://github.com/department-of-veterans-affairs/va.gov-team/issues/34480 for an example of a redirect request issue, built from a template; this may be a good place to add discussion w/ CMS team)

  2. There is no easy way to identify all instances of a given URL within CMS content until that URL changes. Even if the process above were in place, there is currently not a good way to identify all the content in CMS that would need to be updated. This needs to be available.

  3. Code freeze and code ownership rules created significant friction in getting updates approved. Both of the revert PRs that were created to roll back the URL change required approval from small teams of code owners. As it is currently the holiday period, almost all of these team members were unavailable. The PRs could apparently not be merged without approval from the code teams, even by admins of the repos. It would be in our collective interest to create processes for dealing with this sort of situation. This could be:

    • expanding the number of members with code ownership
    • preventing merge to repos while code freeze is active
    • creating a process by which code ownership can be overridden
  4. The broken link checker does not have access to redirects. Our broken link checking mechanism operates on files existing that correspond to a link, rather than whether that link resolves. Even if a redirect is correctly created to forward a user from URL A to URL B, if the location of that content moves from URL A to URL B, and content within content build attempts to link to URL A, a broken link will be reported. This is both unexpected and unproductive, and we should fix it.

  5. It may be unproductive that broken links break builds. Our support team is extremely responsive to broken link issues. It would probably be more productive to escalate significant numbers of broken links if they persist, rather than prevent build & deployment entirely.

  6. Non-productive changes to content-build should not break production processes. As we are in code freeze, there was no intention of this change launching to production until the new year. The engineers working on this change reasonably did not expect this sort of change to cause an incident with production deploy processes. We should look at how the content-build process is used and see if we can remove coupling like that which was observed today.

timcosgrove commented 2 years ago

Time spent on resolving this was about 5-6 hours.