bcgov / cloud-pathfinder

This is the technology and UX backend repo for the cloud pathfinder ZenHub task board
https://app.zenhub.com/workspaces/cloud-pathfinder-5e4dbb426c3c6af8dcbf06a7/board?repos=241742911
Creative Commons Zero v1.0 Universal
2 stars 8 forks source link

Incident Management Document - near term implementation plan #2950

Closed ThibaultBC closed 1 week ago

ThibaultBC commented 2 months ago

Describe the Issue Describe our Incident management procedure with an initial diagram and high level document

Additional Context This is an initial step and we will add iterations to this document with time. Description of Inicident/Situation Management: Incident (or Situation) management is meant to establish processes that quickly identify, prioritize and resolve incidents as they occur, so as to mitigate the potential of disruption to Ministry client operations

Incident Management diagram - business requirements We need to create a diagram documenting incident management for the AWS landing zone. Incident is an event when an existing service is not working or not performing as expected. Audience: The diagram is intended to be shared with our clients that using the AWS landing zone who want to know that our team is prepared to deal with outages and has a formal operational process in place for this. Process Scope: The incident management flow in the diagram should reflect what improvements to the current approach to dealing with outages we can introduce "tomorrow" without hiring more Platform Admins or a vendor company to manage our service. The incident management should only focus on the "landing zone"-level outages, client application outages are out of scope. Areas of potential outages: There are 3 areas within our service infrastructure where incidents can happen: within AWS managed infrastructure, within our custom built landing zone and within the third-part firewall service that we procured from Checkpoint. The diagram should include paths for dealing with each of these 3 outages. Source of information about an incident: An incident can either be reported by a client, reported by our own monitoring service or detected by a Platform Admin.

Acceptance Criteria

ThibaultBC commented 1 week ago

Iteration of document presented on June 18th. Good state, adjustments to make. Closing this ticket as the initial part of this effort is done, I will create a follow-up ticket for next steps.