department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
97 stars 69 forks source link

Forms: 102850c download failures investigation #10096

Closed jilladams closed 2 years ago

jilladams commented 2 years ago

Describe the defect

https://dsva.slack.com/archives/C52CL1PKQ/p1659976483935999

Form 102850c = almost the sole source of failures (as seen by google analytics) over the weekend –- 48 on Sat and 58 yesterday. On Fri it started to take the lead, but only by a nose, and it wasn't a leader on Thurs.

image (8)

Labels

(You can delete this section once it's complete)

CMS Team

Please check the team(s) that will do this work.

dsasser commented 2 years ago

Result

Issue caused by vets-api (Lighthouse) latency and outage incident. This coincides with reported outages within AWS around this time as well (correlation, not necessarily causation).

Troubleshooting

I visited the offending page on va.gov and clicked the PDF download link. After more than 10 seconds, I received the following modal popup:

Screen Shot 2022-08-08 at 9 37 19 AM

Clicking the 'Download VA Form 10-2850C' link caused the following alert to appear:

Screen Shot 2022-08-08 at 9 37 30 AM

The link to 'email the form managers' was to:

mailto:VaFormsManagers@va.gov?subject=Bad%20PDF%20link&body=I%20tried%20to%20download%20this%20form%20but%20the%20link%20doesn't%20work%3A%20https%3A%2F%2Fwww.va.gov%2Fvaforms%2Fmedical%2Fpdf%2Fvha-10-2850c-fill%2520(1).pdf

The console was logging two errors:

CORS Error

Access to fetch at 'https://api.va.gov/v0/forms?query=10-2850C' from origin 'https://www.va.gov' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

I suspect the CORS error was really a symptom of the larger API/network problems. A response without the expected CORS headers will cause this behavior, and some errors coming back from the API may not contain CORS headers.

Network Error

The network error was not captured in the browser at that time, but a subsequent curl command ran from the terminal to the same endpoint resulted in a 502 Bad Gateway:

▶ curl -I https://api.va.gov/v0/forms\?query\=10-2850C
HTTP/1.1 502 Bad Gateway
Date: Mon, 08 Aug 2022 16:40:23 GMT
Content-Type: application/json; charset=utf-8
Connection: keep-alive
Referrer-Policy: strict-origin-when-cross-origin
Vary: Origin
X-Content-Type-Options: nosniff
X-Download-Options: noopen
X-Frame-Options: SAMEORIGIN
X-Git-SHA: 27c09b82bf6aecb12924bb041b16e41e7323adfa
X-GitHub-Repository: https://github.com/department-of-veterans-affairs/vets-api
X-Permitted-Cross-Domain-Policies: none
X-Request-Id: fe4ae8f1-8131-4f5c-a55a-3306ba6b4528
X-Runtime: 0.132098
X-XSS-Protection: 1; mode=block
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Set-Cookie: TS01f27c67=0119a2687fc58ea7e6fa2c72712d44a5437365e15fe5d1b5255af75fd30161342c27f0057c250e448b1429e47c30f14e5605096bac; Max-Age=900; Path=/
Transfer-Encoding: chunked

Further probing of this and other Forms endpoints, including the Forms Search page, revealed additional API symptoms:

Screen Shot 2022-08-08 at 9 50 21 AM

At this point it was clear we were dealing with a larger Lighthouse outage. Steve Wirt created a Platform ticket to track the issue.

jilladams commented 2 years ago

@wesrowe if you wanna close this after you pass on relevant details to forms managers, feel free.

wesrowe commented 2 years ago

@jilladams, I think it depends on scope. Today's issue seems to have begun late last week. If we're looking at this issue as just about today, we can close it. If we're concerned about the weekend that led up to today... I'm hoping for Lighthouse logs that would show the Forms API issue began several days ago.

jilladams commented 2 years ago

Daniel / Wes to pair on looking at the flagged content info and info on this form. Goal: see what sort of logs / data we need, in order to understand where failures are happening, and go ask the teams who own those systems. (Some of this may fall in a new ticket.)

jilladams commented 2 years ago

We identified that this form was broken by a changed filename, (1) added on Fri 8/5. This broke existing links in the front end, during the window where form name was changed but content-build was not yet released. After filename change, content-build did not run over the weekend, and failed several times on Monday prior to succeeding. Successful content-build resolved the issue.

Content-build failure notification is an open ticket, https://github.com/department-of-veterans-affairs/va.gov-cms/issues/9963

File name issue will get handled with Janel Keyes via email, @wesrowe to chase that. Closing.