Closed wesrowe closed 8 months ago
refinement notes:
There's an opportunity to refactor the DowntimeNotifications component, removing unused code. The call to getGlobalDowntime()
has been commented out for 3 years.
The presence of the DowntimeNotification
component and externalServices
object that ties into PagerDuty suggests to me that the scheduled or expected downtime has an existing solution, and next steps, based on Platform Downtime Notifications are:
DowntimeNotification
and assign dependencies. The landing page would depend on the "all mhv" service. Apps like medications would depend on the "all mhv" service plus a medications-specific serviceIt's unclear what steps would be taken for unplanned downtime, aka an outage. PagerDuty certainly has mechanisms for reacting to outages, but I don't know if/how that is set up for VA PagerDuty. I'm not sure if an outage can or should be handled as a maintenance window.
Notes from sync with Sara, Keith, Jim, Wes:
I've drafted a doc in va.gov-team for general findings. It needs work and updates since the MHV app/services teams have merged their PR
Updated Mural for downtime notifications design
@keithcheung828 We can confirm that the DowntimeNotification
component allows us to customize what gets rendered, with the minor caveat that we would need to handle logic around approaching vs current downtime in addition to the logic for what the combination of alerts would look like.
When reading this user story, I began to wonder how we might handle a service outage and wrote out some thoughts.
User Story: As an authenticated user who has access to health tools, I want to be notified in the case that a back-end system has a scheduled or unscheduled downtime window before I have a bad experience.
Unscheduled downtime, as in "a supporting service will be taken down, soon/now" can be handled as a scheduled downtime event, and a corresponding maintenance window for the service can be created within PagerDuty, triggering the DowntimeNotification
component.
Unscheduled downtime, as in "a supporting service has become unresponsive," aka, an outage, should be handled 1) gracefully by FE code when CRUDing resources, 2) by the breakers
gem on the BE preventing requests to the downstream service and responding with an appropriate HTTP status code, 3) by ops teams, who would see that one or more services have become unresponsive, 4) by team members, who would create a maintenance window within PagerDuty for the downed service as part of the incident response procedure. Note that the breaker status of services can be found in Datadog.
Perhaps there is an opportunity here to have vets-api update PagerDuty when a breaker trips on a service integration, automatically creating a maintenance window.
FWIW, the maintenance banner says:
This component is ONLY for site-wide system status messages. There is no other appropriate use.
We need to carry this story into sprint 22. There seems to be some needed coming together between Engineering and Design. The AC is "Drafted recommended approach across landing page and tools."
Consulted with DS team that the downtime alert design is based on the alert page on design systems
Some interesting discoveries that will guide our recommendation:
recommendation scope from meeting 1/8 (Daniel, Keith, Wes):
Also in our discussion, it seemed we were focused on timed maintenance windows, aka scheduled downtime. PagerDuty will provide start and end times regardless of whether the maintenance window was planned/scheduled or not.
The platform component doesn't mention start or end times, perhaps assuming that the unscheduled downtime case is more common than the scheduled maintenance window.
I told Janie (PO for SM) in this thread that we would share our recommendation there when it's ready. There's another thread where I should post it also when the time comes.
Mural updated to contain:
MHV downtime
Single tool downtime
Multi tool downtime
I merged the discovery doc so its editable on the main branch now: https://github.com/department-of-veterans-affairs/va.gov-team/blob/master/products/health-care/digital-health-modernization/engineering/mhv-downtime-notification-discovery.md
Please see the updated mural link.
@keithcheung828, is this screenshot from the mural accurate? It's supposed to be the current state of the Inbox page, in the pre-maintenance timeframe. But it doesn't have an H1.
@keithcheung828 made final changes to mural. closing!
Description
User story
As an authenticated user who has access to health tools, I want to be notified in the case that a back-end system has a scheduled or unscheduled downtime window before I have a bad experience.
Notes
va-alert
componentyarn mock-api
Slack threadRight after we started this story we learned the tool teams had a PR up for adding downtime alerts to their apps:
Goals for the solution:
Acceptance criteria
Tasks