department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
79 stars 59 forks source link

Add synthetic monitoring of prod Facility Locator #9399

Open swirtSJW opened 2 years ago

swirtSJW commented 2 years ago

User Story or Problem Statement

As as maintainers of the facility locator, I want synthetic monitoring of the Facility Locator to make sure we are alerted if Veterans are not getting results for their query..

Recommended as a result of Postmortem

Acceptance Criteria

zachclarity commented 2 years ago

Steps 1 - Go to Max.Gov Jira HERE

2 - Submit a request on Max.Gov for access to DataDog (instance name: vagov.ddog-gov.com

3 - Check your email inbox for an invite to Okta

4 - Log in to Ablevets Okta HERE

5 - Launch DataDog from Ablevets Okta under “My Apps” (app name: vagov-va-datadog)

jtmst commented 2 years ago

Completed with the addition of this UX monitoring browser test in datadog:

https://vagov.ddog-gov.com/synthetics/details/5jw-dyv-4ey

jilladams commented 5 months ago

Noting:

This original monitor was among the first of its kind for Sitewide, before Code Yellow / Watch officer existed, and before we had teams / dashboards in Facilities, so this monitor alarmed to the #oncall channel in DSVA slack for Platform response, and Facilities team was not responsible for triage afaik.

The ticket here was to let us know if Veterans don't get Facility Locator response as expected from a browser test. We now have a few other monitors on Facility Locator API endpoints that will alert us to anomalies in traffic. We don't have any other browser synthetic test.

It might be worth reviving this and creating a similar new monitor if Plat can't help us surface the old one / revise it. Plat thread opened here about the fact it's gone now: https://dsva.slack.com/archives/CBU0KDSB1/p1705532707740479

FYI @xiongjaneg

jilladams commented 5 months ago

From Plat:

the resources pointed to by this url do not exist. You can go ahead and recreate it.

Reopening to track in the backlog.

xiongjaneg commented 5 months ago

Noting other resources are available from the datadog channel, Adrian Rollett, etc.

xiongjaneg commented 5 months ago

Please add your planning poker estimate with Zenhub @eselkin

eselkin commented 5 months ago

I've tried creating a synthetic test for facility locator and it doesn't work. The browser synthetic test can no longer load WebGL which it needs for facility locator.

jilladams commented 5 months ago

Noted you'd flagged that concern before, when this came up in refinement today. @eselkin I'd love to get a clear sense of what is different now than when the original synthetic monitor was created / worked. I don't not believe you (double neg?) just we need to sort out how it worked before / doesn't now, and if there's something Datadog owners could enable that would unblock it, or if WebGL is needed, there's potential we could file this as a feature request with Datadog on behalf of VA as well, etc. Any screenshots or something of what happens when we try could probably help with sorting that out.

mmiddaugh commented 5 months ago

@xiongjaneg two additional notes from refinement which may need to be reflected in AC

xiongjaneg commented 5 months ago

Please add your planning poker estimate with Zenhub @maxx1128

eselkin commented 5 months ago

@jilladams I created a synthetic test here when we were noticing issues: https://vagov.ddog-gov.com/synthetics/details/apx-2wv-n92?from_ts=1705615225518&to_ts=1706220025518&live=true

@mmiddaugh Name of the monitor I created is [Facilties] Facility Locator

You can see it says "PASSED" (I tried running the test will all browsers but all had the same issue)

Screenshot 2024-01-25 at 2 06 26 PM

but the screenshot and error messages tell everything at the bottom of the page. The screenshot shows no loaded Facility Locator because of the errors. step-0__1706216412661

The errors show:

Screenshot 2024-01-25 at 2 03 12 PM
davidmpickett commented 1 month ago

@jilladams @eselkin This doesn't have points or a sprint assigned. Is this actually something that should be considered for Sprint 4?

jilladams commented 1 month ago

Same here: weird workflow problem, not sure what happened. Moving to backlog.

jilladams commented 1 month ago

Noting: today the Facility Locator experienced a spike in 403 errors from the new Facilities-api v2 endpoint: https://vagov.ddog-gov.com/apm/services/facility-locator/operations/rack.request/resources?dependencyMap=qson%3A%28data%3A%28telemetrySelection%3Aall_sources%29%2Cversion%3A%210%29&env=eks-prod&fromUser=true&groupMapByOperation=null&panels=qson%3A%28data%3A%28%29%2Cversion%3A%210%29&resources=qson%3A%28data%3A%28visible%3A%21t%2Chits%3A%28selected%3Atotal%29%2Cerrors%3A%28selected%3Atotal%29%2Clatency%3A%28selected%3Ap95%29%2CtopN%3A%215%29%2Cversion%3A%211%29&summary=qson%3A%28data%3A%28visible%3A%21t%2Cerrors%3A%28selected%3Acount%29%2Chits%3A%28selected%3Acount%29%2Clatency%3A%28selected%3Alatency%2Cslot%3A%28agg%3A75%29%2Cdistribution%3A%28isLogScale%3A%21f%29%2CshowTraceOutliers%3A%21f%29%2Csublayer%3A%28slot%3A%28layers%3Aservice%29%2Cselected%3Apercentage%29%29%2Cversion%3A%211%29&view=spans&start=1716924577835&end=1716926350000&paused=true

We are still triaging, and not clear exactly what happened / is happening here, but we should try to prioritize Facility Locator monitoring potentially, pending the results of what we find today.