berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
62 stars 37 forks source link

Collating information about outages for Incident Reports #2791

Open balajialg opened 2 years ago

balajialg commented 2 years ago

All,

I wanted to collate all the information about the outages with varied hubs in a single place. I see this useful from multiple perspectives,

  1. Help us write consolidated incident reports in the future
  2. Verify @felder's GCP queries
  3. Evaluate whether the outage caused is due to an issue which we already fixed
Date | Hubs/Services Affected by the Outage | User Impact | Reasons -- | -- | -- | -- August 20th, 2021 | Datahub RStudio, dlab.datahub RStudio | 300+ students as part of the R workshop | Due to this issue (https://github.com/berkeley-dsep-infra/datahub/issues/2585), this [PR](https://github.com/berkeley-dsep-infra/datahub/pull/2588) was created. Jupyter Client went through a major upgrade which broke the system. August 26th, 2021 (First day of class) | R hub, Datahub | Stats, Econ students were not able to log into their hubs | Due to this issue (https://github.com/berkeley-dsep-infra/datahub/issues/2628), this PR (https://github.com/berkeley-dsep-infra/datahub/pull/2629) was created. Related to blocking request for course scope through the canvas. September 2nd, 2021 | Data 100 | around 10+ students | This issue (https://github.com/berkeley-dsep-infra/datahub/issues/2688) was due to the addition of the voila package September 13th, 2021 | Prob 140 | No Data on the impact of the outage | Check the PR that fixed this issue here (https://github.com/berkeley-dsep-infra/datahub/pull/2749)! The size of the DB was full due to logs. September 16th, 2021 | Data 100 | 50+ students reported issues with their Hub instance | Hub restarted with a delay after a PR(https://github.com/berkeley-dsep-infra/datahub/pull/2768) got merged resulting in an interim outage for users September 29th, 2021 | EECS Hub | All students in EECS 16A lab reported memory-related issues with their Hub instance | NFS disk was full resulting in this error. Issue description and solution can be found in this issue (#2808) October 19th, 2021| R Hub | All users in the R Hub | storage problem with the hub resulting in this error. Issue description and solution can be found in this issue (#2902 ) January 20th, 2022 | Many hubs | Most GSIs across multiple hubs | For more information, refer [here](https://hackmd.io/@S8IzWL8vR8OscKIEt7sXjA/HkMqNEvat/edit) February 2nd, 2022 | Data 100 hub | Minor outage for a few students | PR merge to prod triggered the pods to be knocked out of the hub August 8, 2022, | All hubs | Outage that affected all hubs including Data 8 students | @yuvipanda fixed the core node issue by killing the core node which resulted in the outage August 23, 2022, | Data 100 hub| Outage that affected some Data 100 instructors and students | Sep 5, 2022 | Data 100 hub| Outage that affected a few Data 100 students | Sep 11, 2022 | Data 100 hub, Biology Hub| Outage that affected all hubs| Sep 12, 2022 | Stat 20 hub, R Hub| Outage that affected students using R Hub| Issue details are in #3740 Sep 14, 2022 | Data 102 Hub| Outage that affected few students | Sep 15, 2022 | Stat 20 hub, Data 100 Hub, R Hub| Outage that affected all the hubs | Sep 18, 2022 | Data 100 Hub| Outage that affected all the hubs due to NFS server issue| NFS restart brought hubs back Oct 7, 2022| All hubs | Hubs down due to NFS server issue which affected all users for a short period of time| Yuvi restarted the NFS server which brought the hubs back Oct 9, 2022| All hubs | Hubs down due to NFS server issue which affected all users for a short period of time| Yuvi restarted the NFS server which brought the hubs back Oct 10, 2022| All hubs | PR moving all hubs to NFS v3.0 from v4.0 resulted in a crash that affected all users for a short period of time| Reverted back to the original state Oct 12, 2022| Stat 20 hub | Start-up times went really high for the Stat 20 hub. https://github.com/berkeley-dsep-infra/datahub/issues/3836 is tracking this | Yuvi moved Stat 20 to a different node pool altogether Oct 15, 2022 | Data 101 hub| Users reported 403 error| process to delete inactive users resulted in race condition. Yuvi deleted the process which brought the hub back Oct 27, 2022 | Data 8 Hub | Data 8 Hub users not able to access their pods | Not able to recollect the reasons for this outage Oct 30, 2022 | Outage that affected all the hubs | Hubs were unusable for a short duration of time | Yuvi drained the nodes which had all the affected pods Nov 11, 2022 | Data 8, Data 100, and Data 101 Hubs are down | Hubs were unusable for most users for a short duration of time |Outage due to node auto scaler issue which is highlighted in this issue #3935 Dec 2, 2022 | All hubs were down | Hubs were unavailable for all users for a period of 2 hours| Outage due to nginx related issue Feb 24, 2023 | All hubs were down | Hubs were unavailable for all users for a period of 30 mins - particularly disruptive for Data 8 hub| Outage due to nginx related issue Sep 30, 2023| All hubs were down| Hubs were unavailable for all users for a period of 40 mins | Outage due to tcp OOM/nginx related issue Dec 4, 2023| All hubs were down for 10-12 mins| Hubs were unavailable for all users for 15 mins| Outage due to tcp memory related issue Dec 5, 2023| All hubs were down for 35 mins| Hubs were unavailable for all users from 11:10 - 11:40 PM| Outage due to tcp memory related issue Feb 7 and 21, 2024| Users were getting "white screen" issue when they tried to log into Datahub| Datahub, Data 100, Data 8, Prob 140 users were getting this error message when they log into Datahub. Clearing cache, restarting server, incognito window, using another browser are the available options | There is no clarity around the reason for the issue. Piloting fork of CHP is considered a possibility but there is no definitive evidence around the root cause for the issue. Feb 23, 2024 | Core node restart caused intermittent outage 5 times between 8.30 - 9 PM. | All hubs were affected | Core node was being autoscaled down from 1 --> 0, which had the effect of killing and restarting ALL hub pods. @shaneknapp disabled autoscaling in the `core` node pool and pinned the node pool size to 1. Since then, we haven't had this issue again. April 5, 2024 | Jupyterhub upgrade to 4.1.4 and nbgitpuller upgrade to the latest version 1.2.1 broke nbgitpuller functionality | Multiple hubs such as Data 8, 100, 101 were affected while pulling the notebooks from github repositories | Fix to debump Jupyterhub 4.1.4 and nbgitpuller to 1.1.0 fixed the issue for users facing issues with nbgitpuller link.
balajialg commented 1 year ago

@ryanlovett You had ideas about combining this issue with pre-existing after-action reports. Do you think #3539 seems like a viable next step or you had something else in mind?

ryanlovett commented 1 year ago

@balajialg I think every incident should be followed up by a blameless incident report, https://github.com/berkeley-dsep-infra/datahub/blob/staging/docs/admins/incidents/index.rst. Perhaps when there are outages, you can create a github issue which tracks the creation of the incident report and assign it to the admin with the most insight into it.

The reports should follow a template with a summary, timeline, and action items to prevent the issue from recurring. They should be published in the docs.

balajialg commented 1 year ago

@ryanlovett For the future, I will create an incident report template that any of the admins with insight can fill. That would make it easy to start filling AAR when an outage happens.

However, what about the outages reported during fall 22? Do we want to create one incident report that collectively summarizes learnings and scopes the next steps? I am not sure whether doing an individual AAR is possible given the scope of work required.

Possibly this can be a discussion item for the Monthly Sprint Planning meeting.

ryanlovett commented 1 year ago

Ideally each incident would have a separate report since there are often different factors. This semester there were outages due to core nodes, image pulling delays, and the file server. The problem with creating reports too far after the fact is that our memories are hazy.

Are AARs and incident reports the same thing? Our previous incident reports contained an "action item" section which sounds similar to "After Action" reports. Is an AAR part of an RTL protocol? Wherever the action items are placed, it'd be good if they're found in a single place.

balajialg commented 1 year ago

@ryanlovett Apologies for using AAR and incident reports interchangeably while meaning the same. There is an RTL protocol for sharing a detailed outage template with relevant information to the leadership. However, that is more about the logistics of resolving the outage. It doesn't focus on the technical specifics of the incident report.

If @shaneknapp has the bandwidth and is interested then we can publish an incident report for fall 22 which outlines the issues due to a) core nodes, b) file server, and c) image pulling delays and the steps we took in the near term to resolve the outage and the plans we have for the long term to eliminate the reasons for such outages.

balajialg commented 1 year ago

Review the data once again!