department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 198 forks source link

Discovery: How might MHV on VA.gov alert users about EHR system downtime? #70798

Closed wesrowe closed 8 months ago

wesrowe commented 9 months ago

Description

User story

As an authenticated user who has access to health tools, I want to be notified in the case that a back-end system has a scheduled or unscheduled downtime window before I have a bad experience.

Notes

Right after we started this story we learned the tool teams had a PR up for adding downtime alerts to their apps:

Goals for the solution:

Acceptance criteria

Tasks

wesrowe commented 9 months ago

refinement notes:

radavis commented 9 months ago

There's an opportunity to refactor the DowntimeNotifications component, removing unused code. The call to getGlobalDowntime() has been commented out for 3 years.

dcloud commented 9 months ago

The presence of the DowntimeNotification component and externalServices object that ties into PagerDuty suggests to me that the scheduled or expected downtime has an existing solution, and next steps, based on Platform Downtime Notifications are:

It's unclear what steps would be taken for unplanned downtime, aka an outage. PagerDuty certainly has mechanisms for reacting to outages, but I don't know if/how that is set up for VA PagerDuty. I'm not sure if an outage can or should be handled as a maintenance window.

dcloud commented 9 months ago

Began a doc for this at https://github.com/department-of-veterans-affairs/va.gov-team/pull/72520/files?short_path=db315a8#diff-db315a8adad6908236b3c8cd9f4d60ac6a40ac2dbdd57cd9048cdb50523450e5

wesrowe commented 9 months ago

Notes from sync with Sara, Keith, Jim, Wes:

dcloud commented 9 months ago

I've drafted a doc in va.gov-team for general findings. It needs work and updates since the MHV app/services teams have merged their PR

keithcheung828 commented 8 months ago

Updated Mural for downtime notifications design

dcloud commented 8 months ago

@keithcheung828 We can confirm that the DowntimeNotification component allows us to customize what gets rendered, with the minor caveat that we would need to handle logic around approaching vs current downtime in addition to the logic for what the combination of alerts would look like.

radavis commented 8 months ago

When reading this user story, I began to wonder how we might handle a service outage and wrote out some thoughts.

User Story: As an authenticated user who has access to health tools, I want to be notified in the case that a back-end system has a scheduled or unscheduled downtime window before I have a bad experience.

Unscheduled downtime, as in "a supporting service will be taken down, soon/now" can be handled as a scheduled downtime event, and a corresponding maintenance window for the service can be created within PagerDuty, triggering the DowntimeNotification component.

Unscheduled downtime, as in "a supporting service has become unresponsive," aka, an outage, should be handled 1) gracefully by FE code when CRUDing resources, 2) by the breakers gem on the BE preventing requests to the downstream service and responding with an appropriate HTTP status code, 3) by ops teams, who would see that one or more services have become unresponsive, 4) by team members, who would create a maintenance window within PagerDuty for the downed service as part of the incident response procedure. Note that the breaker status of services can be found in Datadog.

Perhaps there is an opportunity here to have vets-api update PagerDuty when a breaker trips on a service integration, automatically creating a maintenance window.

dcloud commented 8 months ago

FWIW, the maintenance banner says:

This component is ONLY for site-wide system status messages. There is no other appropriate use.

wesrowe commented 8 months ago

We need to carry this story into sprint 22. There seems to be some needed coming together between Engineering and Design. The AC is "Drafted recommended approach across landing page and tools."

keithcheung828 commented 8 months ago

Consulted with DS team that the downtime alert design is based on the alert page on design systems

wesrowe commented 8 months ago

Some interesting discoveries that will guide our recommendation:

wesrowe commented 8 months ago

recommendation scope from meeting 1/8 (Daniel, Keith, Wes):

dcloud commented 8 months ago

Also in our discussion, it seemed we were focused on timed maintenance windows, aka scheduled downtime. PagerDuty will provide start and end times regardless of whether the maintenance window was planned/scheduled or not.

The platform component doesn't mention start or end times, perhaps assuming that the unscheduled downtime case is more common than the scheduled maintenance window.

wesrowe commented 8 months ago

I told Janie (PO for SM) in this thread that we would share our recommendation there when it's ready. There's another thread where I should post it also when the time comes.

keithcheung828 commented 8 months ago

Mural updated to contain:

  1. MHV downtime

    Screenshot 2024-01-10 at 10.56.18 AM.png
  2. Single tool downtime

    Screenshot 2024-01-10 at 12.00.36 PM.png
  3. Multi tool downtime

    Screenshot 2024-01-10 at 12.10.18 PM.png
keithcheung828 commented 8 months ago
  1. If alerts are not able to specify tools that are down, an alternative design is shown below: Screenshot 2024-01-10 at 12.14.49 PM.png
dcloud commented 8 months ago

I merged the discovery doc so its editable on the main branch now: https://github.com/department-of-veterans-affairs/va.gov-team/blob/master/products/health-care/digital-health-modernization/engineering/mhv-downtime-notification-discovery.md

keithcheung828 commented 8 months ago

Please see the updated mural link.

wesrowe commented 8 months ago

@keithcheung828, is this screenshot from the mural accurate? It's supposed to be the current state of the Inbox page, in the pre-maintenance timeframe. But it doesn't have an H1.

image
wesrowe commented 8 months ago

@keithcheung828 made final changes to mural. closing!