Closed jhouse-solvd closed 3 years ago
Graphic that shows what the site looked like from the user perspective:
fwiw google search console also supports broken link (asset) checking https://support.google.com/webmasters/answer/9128668?hl=en
AWS cloudwatch Synthetics uses puppeteer under the hood. The built-in link checker is probably not what we'd like to use since it uses
document.getElementsByTagName('a')
and in this case we're more interested in 'script' and 'link' tags. Using puppeteer, however, we should check that these have been loaded as well as check for Concole errors In the browser we can easily do the following:
Also, the solution I suggest above is more generic than just looking for broken links. @jhouse-solvd possibly rename this ticket to something like "detect basic errors in va.gov" or something similar.
Looks like datadog also provides similar functionality: https://docs.datadoghq.com/getting_started/synthetics/browser_test/
So it looks like when the site was broken as is shown in this first image, the console does indeed show errors loading the files. This seems to indicate that checking for errors in the console can catch these type of errors.
The term "broken links and assets" sounds pretty good, to clarify this definition it would be a 200 success response code for va.gov homepage:
Revising https://github.com/department-of-veterans-affairs/va.gov-team/issues/19843#issuecomment-780167509, seems to me that if we just check in the headless browser for errors it'll catch missing assets as well as all javascript errors. For html links issues, we have the build check these site wide.
Revising #19843 (comment), seems to me that if we just check in the headless browser for errors it'll catch missing assets as well as all javascript errors. For html links issues, we have the build check these site wide.
@drorva checking for all browser console errors is probably a valid approach, just to clarify those may not be javascript errors if JS assets are never loaded
also fysa there is a CSP "report only" message in console which should not be considered an error
[Report Only] Refused to connect to 'https://stats.g.doubleclick.net/j/collect?t=dc&aip=1&_r=3&v=1&_v=j89&tid=UA-50123418-16&cid=876194336.1589298734&jid=23633209&gjid=455303181&_gid=1551668648.1617380296&_u=SACAAUABAAAAAC~&z=722159970' because it violates the following Content Security Policy directive: "connect-src 'self' http://localhost:4000 https://*.va.gov https://api.mapbox.com https://www.google-analytics.com http://*.vetsgov-internal https://prod-va-gov-assets.s3-us-gov-west-1.amazonaws.com https://prod-va-gov-maintenance-windows.s3-us-gov-west-1.amazonaws.com https://analytics.foresee.com https://brain.foresee.com https://survey.foreseeresults.com https://device.4seeresults.com https://health.foresee.com https://gateway.foresee.com https://feedback.digital-cloud-gov.voice.medallia.com https://raw.githubusercontent.com wss://northamerica.directline.botframework.com https://northamerica.directline.botframework.com https://search.usa.gov ".
Update: I have created a Puppeteer script that hits the VA staging homepage and logs all console errors. Once the page has finished loading, a filter checks for build errors - the type seen in the screenshot above, i.e. 'Failed to load resource' - and if this error is present, returns true to notify the user of a critical error on the site. Here is a link to the gist: https://gist.github.com/rjohnson2011/483ce2dc7081a3814c0679457e8f68d3
This was passed on to @omgitsbillryan on 4/22 and he is actively working on implementing this script to AWS Canary. https://github.com/department-of-veterans-affairs/devops/pull/9050
5/3 Update: Synced with @omgitsbillryan on 4/30 to go over Puppeteer build script deployment to AWS. The script is running successfully on AWS and triggering alerts if a build error is caught in the console.
PR: https://github.com/department-of-veterans-affairs/devops/pull/9050
@mchelen - Would be great to get your input on this. Please see the recent notes above. Do you have access to view the console and/or relevant alerts?
@rjohnson2011 @jhouse-solvd I have a couple of questions. Are these synthetic events running against the live site VA.gov? There are some errors logged in Sentry (5k now) http://sentry.vfs.va.gov/organizations/vsp/issues/29941/ and looking at the logs looks like some type of automation (using HtmlUnit library). Just wanted to verify this isn't us.
@rjohnson2011 @jhouse-solvd I have a couple of questions. Are these synthetic events running against the live site VA.gov? There are some errors logged in Sentry (5k now) http://sentry.vfs.va.gov/organizations/vsp/issues/29941/ and looking at the logs looks like some type of automation (using HtmlUnit library). Just wanted to verify this isn't us.
Looping in @omgitsbillryan to the convo. When Bill and I last spoke we discussed running the script once every 15 minutes. If the errors are coming in at that frequency it's probably us, otherwise not.
It is being run against staging.va.gov. It is now using AWS Synthetics library to trigger the automated testing tool. HtmlUnit looks to be something different (Java based headless tool), though I'm not 100% what AWS Synthetics uses under the hood.
@rjohnson2011 @jhouse-solvd I have a couple of questions. Are these synthetic events running against the live site VA.gov? There are some errors logged in Sentry (5k now) http://sentry.vfs.va.gov/organizations/vsp/issues/29941/ and looking at the logs looks like some type of automation (using HtmlUnit library). Just wanted to verify this isn't us.
Looping in @omgitsbillryan to the convo. When Bill and I last spoke we discussed running the script once every 15 minutes. If the errors are coming in at that frequency it's probably us, otherwise not.
It is being run against staging.va.gov. It is now using AWS Synthetics library to trigger the automated testing tool. HtmlUnit looks to be something different (Java based headless tool), though I'm not 100% what AWS Synthetics uses under the hood.
Thanks Ryan!. The events in Sentry we were seeing are mostly hitting the live site (not staging) as much. I will confirm the frequency of events in the logs. I did double check the AWS Synthetic library and it looks like it's using Selenium under the hood.
We're currently seeing a lot of instability & unreliability -
Most of the errors (~90% I'd guess) come in the final step of logging in:
I chatted with @empireofryan / @rjohnson2011 and it doesn't seem like we need to be logging in through ID.me in order to check for resources not loading on the page. I'm going to rip out all of that logic and focus only on assets loading.
Side note - I think a canary that runs through login could be useful.
@omgitsbillryan - can you drop your notes with updates and recent findings?
@omgitsbillryan yeah i think the scope of this ticket is to ensure www.va.gov homepage is loading successfully without error, and monitoring of login function might be good to spin off as a followup
@jhouse-solvd anything I can do to help wrap this up?
@omgitsbillryan - before we close this, can we post a screenshot of the alert functioning as expected, and:
to discuss: what would it take to simulate the outage that originally caused this in DEV?
cc: @drorva
An actual alert in slack from PagerDuty - link.
The alert could be simulated by modifying the Javascript in the canary. Something like,
consoleAlerts.filter((alert) => {
if (alert.includes("Failed to load resource")) {
criticalAlerts.push(alert);
}
});
/* Add this line */
criticalAlerts.push("Fake javascript error for testing");
if (criticalAlerts.length > 0) {
throw "Critical page errors discovered!\n- " + criticalAlerts.join("\n- ");
}
Currently the AWS alarm was created manually by me. In the future, we'd like for this alarm to be created & managed in terraform - presumably by the canary module we created.
The canary is currently watching staging
environment, which worked out well since it caught the issue before it made it's way into prod
- #vfs-platform-support
slack link.
@omgitsbillryan this is so great to see! such a perfect example of monitoring catching critical issues.
@jhouse-solvd is there a followup ticket to get the these 2 initial monitors (this and https://github.com/department-of-veterans-affairs/va.gov-team/issues/17631) into terraform?
I've been using https://github.com/department-of-veterans-affairs/va.gov-team/issues/17631 as a dumping ground for all work I've done in this space - probably not ideal 😬 .
All 3 canaries we currently have are managed (mostly), by terraform. The only part that isn't is the AWS Alarm, it's still created manually. I wrote up a ticket to add alarm creation to also be managed by terraform as well - link.
@omgitsbillryan ah ok got it, thanks for explaining!
@omgitsbillryan @jhouse-solvd not seeing any progress on this for the last week either in here or the linked tickets. Can you please update this ticket as you make progress?
I think we can call this complete. I read through and checked off all the boxes in the OP. There's still more work to be done in this space that we could/should do such as :
prod
(right now it only looks at staging)@omgitsbillryan we probably need this monitor on prod
as well, chances are that any problem would occur on staging
first but better safe than sorry
We for sure need to have this monitoring prod before we call it done. The cloudwatch alarm into terraform can wait till we create additional canaries.
@drorva - we've added an additional task and clarified the definition of done to make sure this is working for prod.
@omgitsbillryan - please see updated ticket and comment here when this is working. thank you!
@omgitsbillryan @jhouse-solvd probably this needs to not alert on [Report Only]
console messages like:
[Report Only] Refused to load the image 'https://s3-us-gov-west-1.amazonaws.com/content.www.va.gov/img/design/icons/apple-touch-icon.png' because it violates the following Content Security Policy directive: "img-src 'self' data: blob: https://*.gstatic.com https://api.mapbox.com https://www.google-analytics.com https://www.googletagmanager.com https://stats.g.doubleclick.net https://*.va.gov https://optimize.google.com https://gateway.foresee.com https://static.foresee.com https://cdn-prod.kampyle.com https://prod-va-gov-assets.s3-us-gov-west-1.amazonaws.com https://ok6static.oktacdn.com https://dvp-oauth-application-directory-logos.s3-us-gov-west-1.amazonaws.com".
@mchelen the script already does not alert on that error message. We only alert on errors containing the text, Failed to load resource
.
We've sporadically been seeing a lot of these errors that trigger a false positive :
CONSOLE ERROR Failed to load resource: net::ERR_CONNECTION_RESET
I've added a PR to exclude them. With fewer false positives, it will allow us to be more aggressive with alerting thresholds.
As a side note, we may want to investigate the root of these net::ERR_CONNECTION_RESET
errors - I think it may be some slight misconfiguration of our nginx revproxy, but AFAICT it's benign from the perspective of users of va.gov.
@omgitsbillryan well, that's not reassuring. @meganhkelley can you please collaborate with @omgitsbillryan and create a ticket to investigate how often and why we're getting these errors. I'm not sure which of the two teams should own this, so maybe you and @jhouse-solvd can collaborate on which team should own this.
Hey @omgitsbillryan @jhouse-solvd ! Could y'all jot down some steps to reproduce the error in question, and FE can take a look?
CONSOLE ERROR Failed to load resource: net::ERR_CONNECTION_RESET
@omgitsbillryan - that error could be seen in the developer console sporadically, correct?
That's correct that it's sporadic. I did some extra logging and found that this error always presents itself in conjunction w/
REQUEST FAILED net::ERR_CONNECTION_RESET https://resource.digital.voice.va.gov/wdcvoice/5/onsite/embed.js
Even when this error occurs, I can confirm by looking at canary run screenshots, that the page is still visibly loading properly.
@omgitsbillryan that's helpful. Looks like this is Medalia analytics. So let's change the ignore to specifically ignore this URL, rather than ERR_CONNECTION_RESET
@drorva - would it be okay for us to create that as a separate ticket and plan for an upcoming sprint?
ie. "Ignore Medallia analytics errors in Synthetic Monitoring script" or something to that effect?
I'm sensitive to scope creep and want to make sure that we prioritize accordingly.
I don't think so. @omgitsbillryan already is already ignoring
ERR_CONNECTION_RESET So it's just a question of making it a bit more specific. If it's more complicated than that, than yes, we can create a different ticket.
@omgitsbillryan - as time allows, I've updated the AC based on Dror's comment above. Let us know if there are any questions or concerns.
@mchelen - do you mind mentioning/linking any PR that you might be working on? @omgitsbillryan mentioned that you and @empireofryan might be collaborating on a pr and just want to make sure we have it all tied together through this ticket.
This task is done. If there is additional work to be done for the monitors that have been created, we will define those in new tasks.
Closing.
Description
In light of a recent outage that resulted in the VA.Gov website rendering incorrectly and displaying a confusing message for users ('your browser is out of date'; see attachments), we need to explore and implement advanced monitors and corresponding alerts that can detect broken links and assets and notify on-call personnel accordingly.
Background/context/resources
#19824 VA.gov 2/11/21 Site Outage - Post Mortem Thread started by @drorva in DSVA Slack workspace "#platform-team" channel
Technical notes
We may be able to accomplish this using synthetic monitoring in CloudWatch. Some interesting capabilities here: https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-cloudwatch-synthetics-supports-enhanced-monitoring-broken-link-gui-workflow-blueprints/
Tasks
REQUEST FAILED net::ERR_CONNECTION_RESET https://resource.digital.voice.va.gov/wdcvoice/5/onsite/embed.js
Definition of Done
[x] Monitors and alerts are in place for PROD that catches broken links and assets and notifies on-call personnel
Reminders