Closed patphongs closed 4 years ago
Preliminary resources: Hosted system status page: https://status.io/ https://www.atlassian.com/software/statuspage
Usability: Visibility of system status: https://www.nngroup.com/articles/visibility-system-status/
Examples using statuspage.io: https://cloudgov.statuspage.io/ https://www.githubstatus.com/#past-incidents https://status.mural.co/ https://status.slack.com/
Cloud.gov has documentation on their process that we can review: https://cloud.gov/docs/ops/service-disruption-guide/
Who is the status page for? At this point, it's our understanding that the status page is for a non-technical audience that requires quick and informational updates when something happens that they weren't expecting.
Which projects/products need to be included on this status page versus which products are covered with a banner update? Do we need to change our banner processes?
Projects/Products:
Functionality:
How soon is a message posted?
Status page functionality that we want:
Next steps:
Start of conversation in slack on status page here: https://fecgov.slack.com/archives/C3X3K6EVA/p1592416026408900
Spreadsheet for service comparison: https://docs.google.com/spreadsheets/d/1ykAuWjL65uZm-sLt7IcOEWmz6h6EAP4lUYCd-5QXW4Q/edit#gid=0
Questions:
Transparency around your incident history and reliability builds trust with new and existing customers
Make it easier for teams to communicate with their customers during incidents.
The way "most people" dedicate a team to manage their status page.
https://support.atlassian.com/statuspage/docs/know-when-to-automate-your-status-page/
Maybe just Pingdom for our API? Possibly cloud.gov but we might want more control
Define which team(s) and roles own the Statuspage. This is crucial for initial implementation and longevity. Better to sort this out Day 1, before the first live incident.
Document access/account management in fec-accounts
Follow cloud.gov practices, generally. Ask at office hours if they would be willing to share their templates.
References: https://www.donnfelker.com/you-need-a-status-page/ https://hackernoon.com/build-a-great-status-page-in-15-minutes-with-no-budget-98257f67aef1 https://www.atlassian.com/incident-management/handbook#what-is-an-incident https://cloud.gov/docs/ops/service-disruption-guide/
@PaulClark2 @AmyKort @patphongs I would like to schedule a discussion with you all to go over this proposed language. Some of the language included is dependent on which status page service provider we go with, which is something we should discuss as well.
Here's a comparison between top services: https://docs.google.com/spreadsheets/d/1ykAuWjL65uZm-sLt7IcOEWmz6h6EAP4lUYCd-5QXW4Q/edit#gid=0 Within each service there are different tiers depending on the number of team members and other features we would need.
Draft language for the status page processes based on cloud.gov documentation: https://docs.google.com/document/d/1vE3zgh2Mh5h8h07ob5EXi9capgB-e5Ic0nR_R9Imb8Y/edit#heading=h.hlgjk513nsep
Draft example messages for status page posting: https://docs.google.com/document/d/1DapPOsnSN7Q9P3E6OM2BHqYlHKj5ELHjt49BRpFFG5U/edit#heading=h.cvhs00beblfp
cc: @lbeaufort @dorothyyeager
Summary
What we're after: As a FEC product manager and developer, we need to create a strategy of when and what to post on a website status page so that we can inform the public website users when there is a website incident.
Need rules for:
Completion criteria