Open balajialg opened 3 years ago
@ryanlovett You had ideas about combining this issue with pre-existing after-action reports. Do you think #3539 seems like a viable next step or you had something else in mind?
@balajialg I think every incident should be followed up by a blameless incident report, https://github.com/berkeley-dsep-infra/datahub/blob/staging/docs/admins/incidents/index.rst. Perhaps when there are outages, you can create a github issue which tracks the creation of the incident report and assign it to the admin with the most insight into it.
The reports should follow a template with a summary, timeline, and action items to prevent the issue from recurring. They should be published in the docs.
@ryanlovett For the future, I will create an incident report template that any of the admins with insight can fill. That would make it easy to start filling AAR when an outage happens.
However, what about the outages reported during fall 22? Do we want to create one incident report that collectively summarizes learnings and scopes the next steps? I am not sure whether doing an individual AAR is possible given the scope of work required.
Possibly this can be a discussion item for the Monthly Sprint Planning meeting.
Ideally each incident would have a separate report since there are often different factors. This semester there were outages due to core nodes, image pulling delays, and the file server. The problem with creating reports too far after the fact is that our memories are hazy.
Are AARs and incident reports the same thing? Our previous incident reports contained an "action item" section which sounds similar to "After Action" reports. Is an AAR part of an RTL protocol? Wherever the action items are placed, it'd be good if they're found in a single place.
@ryanlovett Apologies for using AAR and incident reports interchangeably while meaning the same. There is an RTL protocol for sharing a detailed outage template with relevant information to the leadership. However, that is more about the logistics of resolving the outage. It doesn't focus on the technical specifics of the incident report.
If @shaneknapp has the bandwidth and is interested then we can publish an incident report for fall 22 which outlines the issues due to a) core nodes, b) file server, and c) image pulling delays and the steps we took in the near term to resolve the outage and the plans we have for the long term to eliminate the reasons for such outages.
Review the data once again!
All,
I wanted to collate all the information about the outages with varied hubs in a single place. I see this useful from multiple perspectives,