department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
98 stars 69 forks source link

Discovery: How did vets-website end-to-end tests pass but React widgets site-wide were affected by footer outage #13730

Closed jilladams closed 1 year ago

jilladams commented 1 year ago

Description

User story

AS A FE engineer I WANT confidence that my changes are tested enough to confirm they will not break prod SO THAT I don't break prod.

Engineering notes / background

Post-mortem action item from: https://github.com/department-of-veterans-affairs/va.gov-team-sensitive/pull/975

In short: content-build changes failed to deploy before vets-website changes. The vets-website changes that deployed were isolated to the footer but somehow broke many, if not all, React widgets. CI tests passed for the push to prod. Why?

This will likely require coordination with the Platform team.

Analytics considerations

Quality / testing notes

Acceptance criteria

Timebox: up to 1 day

randimays commented 1 year ago

Discovery document and Slack thread where I posted it.

Questions and recommendations in the document. I suggest a conversation with the team to determine next steps / ticketing.

jilladams commented 1 year ago

Good work, Randi. Glad you got closure on this, honestly.

We can verify both of those with Wes in scrum tomorrow, I've added to the board.

Also: Reminder to move this doc to PW team folder until we get it into Github somewhere.

randimays commented 1 year ago

@jilladams Thank you. Gaurav asked me to send it over to him once it was documented in the team repo, so I thought I'd check what the final destination of this would be. Makes sense to just keep it here.

randimays commented 1 year ago

Why were React widgets and global header affected by the Drupalized footer deploy failure?

VW = vets-website CB = content-build

How does the footer data work?

First, createHeaderFooterData (in CB) pulls Drupal footer data with GraphQL. It runs this data through formatters for VW consumption.

Next, headerFooterData is added to local storage (window) under VetsGov.headerFooter; this only works for va.gov.

On va.gov, open the Dev console and enter window; the header & footer data exist in VetsGov.

On va.gov/health (TeamSites, i.e. injected header & footer), open the Dev console and enter window; the header & footer data don’t exist under VetsGov.

When CB is built, it creates this file in VW: build/localhost/generated/headerFooter.json. VW uses the contents to populate the injected header/footer.

A file with hard-coded header & footer data in VW (header-footer-data.json) is responsible for populating these sections when a developer is running only VW locally (see dev-template.ejs). In that situation, CB is not providing any Drupal assets.

How is the footer tested?

The formatter functions for createHeaderFooterData (in CB) are unit-tested.

In VW, unit tests cover the footer source code using header-footer-data.json as a data source. The prod footer does not get data from header-footer-data.json but instead from CB / Drupal. For the purposes of a unit test, the footer receives a data object from its parent and doesn’t know where it’s from.

The injected header & footer have a Cypress suite in VW in proxy-rewrite (responsible for header & footer injection) that tests the presence of a <header> or <footer> HTML element, but not the contents.

Question: Can the Cypress suite for the proxy-rewrite header & footer also test the contents?

How is the header tested?

Note: the header is also known as “megaMenu.”

There is a megaMenu Cypress suite with some skipped tests that is TestRail-integrated. It is not running in the Cypress CI. I asked Tze-chiu Lei about this. He did the TestRail integration for this file, but doesn’t know why the suite is not running in CI.

Question: Why isn’t megaMenu.cypress.spec.js running in the CI?

The authenticated megaMenu has unit tests.

There are also tests for components of the header, such as Veteran Crisis Line and Search. They each have all the data they need to render in their source code or from the parent component in VW. These tests won’t fail when Drupal data is missing – this would have to be caught in an end-to-end test where the JavaScript is halted in execution from an error down the stack.

How is the home page tested?

There is a Cypress suite that covers the new-home-page* functionality, but it is only run in local dev. va.gov/new-home-page and va.gov were not identical at the point the tests were written. If this suite had run in CI, it wouldn’t have caught our problem as it only tests /new-home-page.

Question: Can this suite be duplicated, modified to cover va.gov and enabled in CI?

A basic homepage smoke test exists. This is run in the CI.

* /new-home-page is used for staging a new homepage while leaving the current one intact. We launched the latest design in June 2023; as of July 28, 2023, va.gov/new-home-page and va.gov match.

Why did React widgets and the header fail when the footer data did not exist?

The footer’s data structure was changed from an array to an object when we added Drupal data, and VW was changed to consume the data from CB in this structure.

When the CB data did not deploy to prod, VW received the old array format and this broke the new VW footer.

Why did the content-build deploy fail?

Here’s the Slack thread where the incident discussion took place.

A CI issue related to AWS credentials stopped the dev, staging and prod deploys. The Notify CMS part of the pipeline (to alert DataDog) also failed.

We failed to look at the Front End Support Dashboard for the deploy status. It’s not possible to know at this point how the changes were successfully verified in staging.

Theory: without the FE Support Dashboard to notice CB deploy issues, we did not wait long enough for VW changes to hit staging. In that case, we would have verified the pre-existing functionality in staging. We were about 1.5 hours from prod deploy when staging was validated and many others may have been on lunch. We did not see any communications about staging being broken.

What happened in prod when the vets-website changes went in without content-build?

The homepage rendered a blank header and no footer at all, and React widgets did not load (they were stuck on a spinner – more on this further in this doc).

Screenshot 2023-07-28 at 12 42 37 PM Screenshot 2023-07-28 at 12 42 42 PM

Why did the header render empty, but the footer didn’t show up at all?

CB creates the <header> element and adds <div>s for header elements such as banners and Veterans Crisis Line. It creates a div for the header to be populated by VW (via ID targeting); even if there are problems on the VW side, the <header> still exists partially.

The <footer> is also created in CB, but its sole contents are the empty <div> that VW should populate. If there’s nothing for VW to add (because the JS broke), it renders an empty element.

Screenshot 2023-07-28 at 1 04 05 PM

Why did the footer issues break the header and React widgets?

The footer setup code is run before the header setup code, and before React apps are initialized. JavaScript errors in the footer setup will stop execution of anything after it.

All React apps in VW use a startApp platform utility. See facility-locator usage here. Within the platform utility, the first step before starting an app is to set up common functionality across all React apps (like populating the global header & footer). If any of that setup errors out, the apps aren’t started.

What does it look like on the front end when the React app isn’t started?

This is the screen for facility locator. The loading indicator never resolves. At the time of the prod outage, this was happening for every React application in VW, in addition to the header & footer not rendering correctly.

Note: the footer renders a blue background here, but it’s not immediately apparent why. The <section> and <footer> elements have a 0px height (and there are still no footer contents), but when these elements are removed from the DOM, the blue background color goes away.

Screenshot 2023-07-28 at 11 09 10 AM

What about va.gov pages with mostly Drupal content?

Any pages that exist in CB alone were unaffected.

Pages with hybrid Drupal and React content had experiences similar to the below screenshot. Static (Drupal) content renders from CB. But VW (React) content injected into the page (such as a login widget) showed a spinner similar to the facility locator page in the above screenshot.

Screenshot 2023-07-31 at 8 36 53 AM

What happened with the injected header & footer?

Similar to va.gov, va.gov/health had a blank header and no footer. Public Websites does not own any code in TeamSites, so it is difficult to know if any functionality within the site was not working. TeamSites have many pointers to va.gov in an effort to deliver consistent information to veterans and slowly migrate to our systems. Some of their content is CMS-driven and would have been unaffected by the React widget problems on the va.gov side.

Screenshot 2023-07-31 at 8 32 29 AM Screenshot 2023-07-31 at 8 32 48 AM

How did the prod incident get resolved?

We reverted all of the related VW changes to use the pre-existing footer logic. Ryan Koch suggested we do all data transformation in CB and pass it to VW in the format it needed prior to any of this work.

This worked well and we achieved the desired outcome.

CB pull request

Next steps

There are a few Questions in the findings above that could drive discovery and/or implementation tickets.

It seems the most important question to answer is:

Do we want errors in the header or footer to halt execution of va.gov pages?

Ideally, our unit tests and Cypress tests should stop such code from ever reaching main in VW. But if any JavaScript fails, it should not hold up the execution of JS after it unless that code is dependent on the data or related in another way to the failure.

Either way, we need more robust unit and Cypress tests to cover the header, footer and home page functionality and to further consider how our setup in general can be optimized.

Team feedback

This Slack thread captures the discussion and feedback from the wider team regarding next steps.

jilladams commented 1 year ago

Thanks Randi! In today's product sync, DaveC confirmed he'd like to be present to listen in on the discussion, so next step is @wesrowe will help schedule that chat.

wesrowe commented 1 year ago

@jilladams, can we close this ticket prior to the meeting, or is it carrying over? I can go either way.

jilladams commented 1 year ago

Next steps are recommended, I feel ok closing this. I own scheduling the review, which I'll work on by tomorrow EOD.