Monitoring / Alerting - Synthetic monitoring for VA.gov homepage

jhouse-solvd commented 3 years ago

Description

In light of a recent outage that resulted in the VA.Gov website rendering incorrectly and displaying a confusing message for users ('your browser is out of date'; see attachments), we need to explore and implement advanced monitors and corresponding alerts that can detect broken links and assets and notify on-call personnel accordingly.

Background/context/resources

#19824 VA.gov 2/11/21 Site Outage - Post Mortem Thread started by @drorva in DSVA Slack workspace "#platform-team" channel

Technical notes

We may be able to accomplish this using synthetic monitoring in CloudWatch. Some interesting capabilities here: https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-cloudwatch-synthetics-supports-enhanced-monitoring-broken-link-gui-workflow-blueprints/

Tasks

[x] Explore monitoring options for detection of broken links and assets
[x] Implement monitors for broken links and assets
[x] Implement corresponding alerts / notifications for these monitors
[x] Implement the canary to monitor and alert on prod va.gov
[x] Specifically, ignore the following: REQUEST FAILED net::ERR_CONNECTION_RESET https://resource.digital.voice.va.gov/wdcvoice/5/onsite/embed.js

Definition of Done

[x] Monitors and alerts are in place for PROD that catches broken links and assets and notifies on-call personnel

Reminders
[X] Please attach your team label and any other appropriate label(s)
[X] Please attach the needs grooming tag if needed
[X] Please connect to an epic

jhouse-solvd commented 3 years ago

Graphic that shows what the site looked like from the user perspective:

mchelen-gov commented 3 years ago

fwiw google search console also supports broken link (asset) checking https://support.google.com/webmasters/answer/9128668?hl=en

drorva commented 3 years ago

AWS cloudwatch Synthetics uses puppeteer under the hood. The built-in link checker is probably not what we'd like to use since it uses

document.getElementsByTagName('a')

and in this case we're more interested in 'script' and 'link' tags. Using puppeteer, however, we should check that these have been loaded as well as check for Concole errors In the browser we can easily do the following:

open the developer tools
click on the network tab
check that all the requests return 2xx or 3xx (and not 4xx or 5xx which are errors) Need to analyze how to replicate the action in puppeteer.

drorva commented 3 years ago

Also, the solution I suggest above is more generic than just looking for broken links. @jhouse-solvd possibly rename this ticket to something like "detect basic errors in va.gov" or something similar.

drorva commented 3 years ago

Looks like datadog also provides similar functionality: https://docs.datadoghq.com/getting_started/synthetics/browser_test/

drorva commented 3 years ago

So it looks like when the site was broken as is shown in this first image, the console does indeed show errors loading the files. This seems to indicate that checking for errors in the console can catch these type of errors.

Screenshot from 2021-02-25 13-41-39

Screenshot from 2021-02-25 13-39-06

mchelen-gov commented 3 years ago

The term "broken links and assets" sounds pretty good, to clarify this definition it would be a 200 success response code for va.gov homepage:

css assets
js assets
image assets
html links

drorva commented 3 years ago

Revising https://github.com/department-of-veterans-affairs/va.gov-team/issues/19843#issuecomment-780167509, seems to me that if we just check in the headless browser for errors it'll catch missing assets as well as all javascript errors. For html links issues, we have the build check these site wide.

mchelen-gov commented 3 years ago

Revising #19843 (comment), seems to me that if we just check in the headless browser for errors it'll catch missing assets as well as all javascript errors. For html links issues, we have the build check these site wide.

@drorva checking for all browser console errors is probably a valid approach, just to clarify those may not be javascript errors if JS assets are never loaded

also fysa there is a CSP "report only" message in console which should not be considered an error

[Report Only] Refused to connect to 'https://stats.g.doubleclick.net/j/collect?t=dc&aip=1&_r=3&v=1&_v=j89&tid=UA-50123418-16&cid=876194336.1589298734&jid=23633209&gjid=455303181&_gid=1551668648.1617380296&_u=SACAAUABAAAAAC~&z=722159970' because it violates the following Content Security Policy directive: "connect-src 'self' http://localhost:4000 https://*.va.gov https://api.mapbox.com https://www.google-analytics.com http://*.vetsgov-internal https://prod-va-gov-assets.s3-us-gov-west-1.amazonaws.com https://prod-va-gov-maintenance-windows.s3-us-gov-west-1.amazonaws.com https://analytics.foresee.com https://brain.foresee.com https://survey.foreseeresults.com https://device.4seeresults.com https://health.foresee.com https://gateway.foresee.com https://feedback.digital-cloud-gov.voice.medallia.com https://raw.githubusercontent.com wss://northamerica.directline.botframework.com https://northamerica.directline.botframework.com https://search.usa.gov ".

rjohnson2011 commented 3 years ago

Update: I have created a Puppeteer script that hits the VA staging homepage and logs all console errors. Once the page has finished loading, a filter checks for build errors - the type seen in the screenshot above, i.e. 'Failed to load resource' - and if this error is present, returns true to notify the user of a critical error on the site. Here is a link to the gist: https://gist.github.com/rjohnson2011/483ce2dc7081a3814c0679457e8f68d3

This was passed on to @omgitsbillryan on 4/22 and he is actively working on implementing this script to AWS Canary. https://github.com/department-of-veterans-affairs/devops/pull/9050

rjohnson2011 commented 3 years ago

5/3 Update: Synced with @omgitsbillryan on 4/30 to go over Puppeteer build script deployment to AWS. The script is running successfully on AWS and triggering alerts if a build error is caught in the console.

PR: https://github.com/department-of-veterans-affairs/devops/pull/9050

jhouse-solvd commented 3 years ago

@mchelen - Would be great to get your input on this. Please see the recent notes above. Do you have access to view the console and/or relevant alerts?

asg5704 commented 3 years ago

@rjohnson2011 @jhouse-solvd I have a couple of questions. Are these synthetic events running against the live site VA.gov? There are some errors logged in Sentry (5k now) http://sentry.vfs.va.gov/organizations/vsp/issues/29941/ and looking at the logs looks like some type of automation (using HtmlUnit library). Just wanted to verify this isn't us.

empireofryan commented 3 years ago

@rjohnson2011 @jhouse-solvd I have a couple of questions. Are these synthetic events running against the live site VA.gov? There are some errors logged in Sentry (5k now) http://sentry.vfs.va.gov/organizations/vsp/issues/29941/ and looking at the logs looks like some type of automation (using HtmlUnit library). Just wanted to verify this isn't us.

Looping in @omgitsbillryan to the convo. When Bill and I last spoke we discussed running the script once every 15 minutes. If the errors are coming in at that frequency it's probably us, otherwise not.

It is being run against staging.va.gov. It is now using AWS Synthetics library to trigger the automated testing tool. HtmlUnit looks to be something different (Java based headless tool), though I'm not 100% what AWS Synthetics uses under the hood.

asg5704 commented 3 years ago

@rjohnson2011 @jhouse-solvd I have a couple of questions. Are these synthetic events running against the live site VA.gov? There are some errors logged in Sentry (5k now) http://sentry.vfs.va.gov/organizations/vsp/issues/29941/ and looking at the logs looks like some type of automation (using HtmlUnit library). Just wanted to verify this isn't us.

Looping in @omgitsbillryan to the convo. When Bill and I last spoke we discussed running the script once every 15 minutes. If the errors are coming in at that frequency it's probably us, otherwise not.

It is being run against staging.va.gov. It is now using AWS Synthetics library to trigger the automated testing tool. HtmlUnit looks to be something different (Java based headless tool), though I'm not 100% what AWS Synthetics uses under the hood.

Thanks Ryan!. The events in Sentry we were seeing are mostly hitting the live site (not staging) as much. I will confirm the frequency of events in the logs. I did double check the AWS Synthetic library and it looks like it's using Selenium under the hood.

omgitsbillryan commented 3 years ago

We're currently seeing a lot of instability & unreliability -

Most of the errors (~90% I'd guess) come in the final step of logging in:

I chatted with @empireofryan / @rjohnson2011 and it doesn't seem like we need to be logging in through ID.me in order to check for resources not loading on the page. I'm going to rip out all of that logic and focus only on assets loading.

Side note - I think a canary that runs through login could be useful.

jhouse-solvd commented 3 years ago

@omgitsbillryan - can you drop your notes with updates and recent findings?

mchelen-gov commented 3 years ago

@omgitsbillryan yeah i think the scope of this ticket is to ensure www.va.gov homepage is loading successfully without error, and monitoring of login function might be good to spin off as a followup

drorva commented 3 years ago

@jhouse-solvd anything I can do to help wrap this up?

jhouse-solvd commented 3 years ago

@omgitsbillryan - before we close this, can we post a screenshot of the alert functioning as expected, and:

to discuss: what would it take to simulate the outage that originally caused this in DEV?

cc: @drorva

omgitsbillryan commented 3 years ago

An actual alert in slack from PagerDuty - link.

The alert could be simulated by modifying the Javascript in the canary. Something like,

consoleAlerts.filter((alert) => {
  if (alert.includes("Failed to load resource")) {
    criticalAlerts.push(alert);
  }
});

/* Add this line */
criticalAlerts.push("Fake javascript error for testing");

if (criticalAlerts.length > 0) {
  throw "Critical page errors discovered!\n- " + criticalAlerts.join("\n- ");
}

Currently the AWS alarm was created manually by me. In the future, we'd like for this alarm to be created & managed in terraform - presumably by the canary module we created.

The canary is currently watching staging environment, which worked out well since it caught the issue before it made it's way into prod - #vfs-platform-support slack link.

mchelen-gov commented 3 years ago

@omgitsbillryan this is so great to see! such a perfect example of monitoring catching critical issues.

@jhouse-solvd is there a followup ticket to get the these 2 initial monitors (this and https://github.com/department-of-veterans-affairs/va.gov-team/issues/17631) into terraform?

omgitsbillryan commented 3 years ago

I've been using https://github.com/department-of-veterans-affairs/va.gov-team/issues/17631 as a dumping ground for all work I've done in this space - probably not ideal 😬 .

All 3 canaries we currently have are managed (mostly), by terraform. The only part that isn't is the AWS Alarm, it's still created manually. I wrote up a ticket to add alarm creation to also be managed by terraform as well - link.

mchelen-gov commented 3 years ago

@omgitsbillryan ah ok got it, thanks for explaining!

drorva commented 3 years ago

@omgitsbillryan @jhouse-solvd not seeing any progress on this for the last week either in here or the linked tickets. Can you please update this ticket as you make progress?

omgitsbillryan commented 3 years ago

I think we can call this complete. I read through and checked off all the boxes in the OP. There's still more work to be done in this space that we could/should do such as :

implement the cloudwatch alarm into the terraform module so we don't need to create one manually every time we create a new canary
modify the javascript for this canary to also look at prod (right now it only looks at staging)

mchelen-gov commented 3 years ago

@omgitsbillryan we probably need this monitor on prod as well, chances are that any problem would occur on staging first but better safe than sorry

drorva commented 3 years ago

We for sure need to have this monitoring prod before we call it done. The cloudwatch alarm into terraform can wait till we create additional canaries.

jhouse-solvd commented 3 years ago

@drorva - we've added an additional task and clarified the definition of done to make sure this is working for prod.

@omgitsbillryan - please see updated ticket and comment here when this is working. thank you!

mchelen-gov commented 3 years ago

@omgitsbillryan @jhouse-solvd probably this needs to not alert on [Report Only] console messages like:

[Report Only] Refused to load the image 'https://s3-us-gov-west-1.amazonaws.com/content.www.va.gov/img/design/icons/apple-touch-icon.png' because it violates the following Content Security Policy directive: "img-src 'self' data: blob: https://*.gstatic.com https://api.mapbox.com https://www.google-analytics.com https://www.googletagmanager.com https://stats.g.doubleclick.net https://*.va.gov https://optimize.google.com https://gateway.foresee.com https://static.foresee.com https://cdn-prod.kampyle.com https://prod-va-gov-assets.s3-us-gov-west-1.amazonaws.com https://ok6static.oktacdn.com https://dvp-oauth-application-directory-logos.s3-us-gov-west-1.amazonaws.com".

omgitsbillryan commented 3 years ago

@mchelen the script already does not alert on that error message. We only alert on errors containing the text, Failed to load resource.

We've sporadically been seeing a lot of these errors that trigger a false positive :

CONSOLE ERROR Failed to load resource: net::ERR_CONNECTION_RESET

I've added a PR to exclude them. With fewer false positives, it will allow us to be more aggressive with alerting thresholds.

As a side note, we may want to investigate the root of these net::ERR_CONNECTION_RESET errors - I think it may be some slight misconfiguration of our nginx revproxy, but AFAICT it's benign from the perspective of users of va.gov.

drorva commented 3 years ago

@omgitsbillryan well, that's not reassuring. @meganhkelley can you please collaborate with @omgitsbillryan and create a ticket to investigate how often and why we're getting these errors. I'm not sure which of the two teams should own this, so maybe you and @jhouse-solvd can collaborate on which team should own this.

meganhkelley commented 3 years ago

Hey @omgitsbillryan @jhouse-solvd ! Could y'all jot down some steps to reproduce the error in question, and FE can take a look?

CONSOLE ERROR Failed to load resource: net::ERR_CONNECTION_RESET

jhouse-solvd commented 3 years ago

@omgitsbillryan - that error could be seen in the developer console sporadically, correct?

omgitsbillryan commented 3 years ago

That's correct that it's sporadic. I did some extra logging and found that this error always presents itself in conjunction w/

REQUEST FAILED net::ERR_CONNECTION_RESET https://resource.digital.voice.va.gov/wdcvoice/5/onsite/embed.js

Even when this error occurs, I can confirm by looking at canary run screenshots, that the page is still visibly loading properly.

drorva commented 3 years ago

@omgitsbillryan that's helpful. Looks like this is Medalia analytics. So let's change the ignore to specifically ignore this URL, rather than ERR_CONNECTION_RESET

jhouse-solvd commented 3 years ago

@drorva - would it be okay for us to create that as a separate ticket and plan for an upcoming sprint?

ie. "Ignore Medallia analytics errors in Synthetic Monitoring script" or something to that effect?

I'm sensitive to scope creep and want to make sure that we prioritize accordingly.

drorva commented 3 years ago

I don't think so. @omgitsbillryan already is already ignoring

ERR_CONNECTION_RESET So it's just a question of making it a bit more specific. If it's more complicated than that, than yes, we can create a different ticket.

jhouse-solvd commented 3 years ago

@omgitsbillryan - as time allows, I've updated the AC based on Dror's comment above. Let us know if there are any questions or concerns.

jhouse-solvd commented 3 years ago

@mchelen - do you mind mentioning/linking any PR that you might be working on? @omgitsbillryan mentioned that you and @empireofryan might be collaborating on a pr and just want to make sure we have it all tied together through this ticket.

jhouse-solvd commented 3 years ago

This task is done. If there is additional work to be done for the monitors that have been created, we will define those in new tasks.

Closing.

department-of-veterans-affairs / va.gov-team