sgibson91 commented 2 years ago

Summary

OceanHackWeek hub cannot start a server. Reported in https://2i2c.freshdesk.com/a/tickets/172

Impact on users

Hub unusable as no one can start a server.

Important information

Hub URL: oceanhackweek.2i2c.cloud
Support ticket ref: https://2i2c.freshdesk.com/a/tickets/172

Tasks and updates

[x] Discuss and address incident, leaving comments below with updates
[x] Incident has been dealt with or is over
[x] Copy/paste the after-action report below and fill in relevant sections
[x] Incident title is discoverable and accurate
[x] All actionable items in report have linked GitHub Issues

After-action report template

``` # After-action report These sections should be filled out once we've resolved the incident and know what happened. They should focus on the knowledge we've gained and any improvements we should take. ## Timeline _A short list of dates / times and major updates, with links to relevant comments in the issue for more context._ All times in {{ most convenient timezone}}. - {{ yyyy-mm-dd }} - [Summary of first update](link to comment) - {{ yyyy-mm-dd }} - [Summary of another update](link to comment) - {{ yyyy-mm-dd }} - [Summary of final update](link to comment) ## What went wrong _Things that could have gone better. Ideally these should result in concrete action items that have GitHub issues created for them and linked to under Action items._ - Thing one - Thing two ## Where we got lucky _These are good things that happened to us but not because we had planned for them._ - Thing one - Thing two ## Follow-up actions _Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in `infrastructure/`, they can be in other repositories._ ### Process improvements 1. {{ summary }} [link to github issue] 2. {{ summary }} [link to github issue] ### Documentation improvements 1. {{ summary }} [link to github issue] 2. {{ summary }} [link to github issue] ### Technical improvements 1. {{ summary }} [link to github issue] 2. {{ summary }} [link to github issue] ```

sgibson91 commented 2 years ago

Problem

When requesting a server only results in the "Server Requested" log and does not proceed. Alex reported in the ticket this eventually changes to "Your server is stopping, you will be able to start is again once it has finished stopping."

Screen Shot 2022-08-10 at 10 25 26

Recreation

I logged in and tried to start a server myself and also got the "Server Requested" log and no further. I have not yet seen the "Your server is stopping..." message.

I activated kubectl access to the hub and looked at the events of my pod with kubectl describe pod jupyter-sgibson91 which revealed:

Events:
  Type    Reason     Age   From                                   Message
  ----    ------     ----  ----                                   -------
  Normal  Scheduled  17m   gke.io/optimize-utilization-scheduler  Successfully assigned ohw/jupyter-sgibson91 to gke-pilot-hubs-cluster-nb-ohw-e4ba9924-lj4s
  Normal  Pulled     17m   kubelet                                Container image "ghcr.io/oceanhackweek/python:d98b914" already present on machine
  Normal  Created    17m   kubelet                                Created container notebook
  Normal  Started    17m   kubelet                                Started container notebook

Which indicates to me the server started successfully but none of the logs were streamed to the spawning page, and indeed the redirection to the server did not occur.

I tried stopping my server from the UI and then deleting the pod and this is when I see the "Your server is stopping..." message and appear to be stuck there.

All other pods in the namespace are running:

NAME                                           READY   STATUS    RESTARTS   AGE
api-ohw-dask-gateway-b45486c7b-62m9j           1/1     Running   0          12d
continuous-image-puller-44pjd                  1/1     Running   0          47h
controller-ohw-dask-gateway-7f5dd9b6dd-ln9m7   1/1     Running   0          48d
hub-86b885f486-45pms                           2/2     Running   1          5d1h
jupyter-abkfenris                              1/1     Running   0          4h7m
jupyter-almacarolina                           1/1     Running   0          46h
jupyter-anujjain2579                           1/1     Running   0          10h
jupyter-clairedavies                           1/1     Running   0          8h
jupyter-gmanuch                                1/1     Running   0          17h
jupyter-jinjintwice                            1/1     Running   0          9h
jupyter-leonardolaiolo                         1/1     Running   0          8h
jupyter-noraloose                              1/1     Running   0          19m
proxy-5bc58b774d-nmx9j                         1/1     Running   0          34d
traefik-ohw-dask-gateway-694d9776f6-48hxg      1/1     Running   0          40d

sgibson91 commented 2 years ago

I tried restarting the hub pod (by deleting it) and that has resolved the issue, I can now start a server. Not entirely sure what happened though.

abkfenris commented 2 years ago

Do the events from the hub pod show any issues from probe failures?

sgibson91 commented 2 years ago

Unfortunately I don't know how to retrieve logs from before the restart. I tried the following with no luck:

❯ k logs -c hub hub-86b885f486-mwnqt --previous
Error from server (BadRequest): previous terminated container "hub" in pod "hub-86b885f486-mwnqt" not found

abkfenris commented 2 years ago

I believe k describe hub-86b885f486-mwnqt would show probe events, but it looks like the JupyterHub healthcheck that the probes hit is pretty shallow, so they may not be the most effective at catching when a hub pod has an issue.

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/548111f1b7d716b744c3abaa414b76c03b7eeed9/jupyterhub/templates/hub/deployment.yaml#L222-L229

https://github.com/jupyterhub/jupyterhub/blob/3b59c4861f155f868bcf29c00dfa78034d289950/jupyterhub/handlers/pages.py#L584-L592

Do you ship the logs anywhere?

sgibson91 commented 2 years ago

After-action report

These sections should be filled out once we've resolved the incident and know what happened. They should focus on the knowledge we've gained and any improvements we should take.

Timeline

A short list of dates / times and major updates, with links to relevant comments in the issue for more context.

All times in BST (UTC+1).

2022-08-10 15:54 - Incident reported and initial
2022-08-10 15:57 - Incident resolved

What went wrong

Things that could have gone better. Ideally these should result in concrete action items that have GitHub issues created for them and linked to under Action items.

It is not clear what technically went wrong, because... Update: This was an outage of the k8s master on the 2i2c cluster
Engineer didn't collect all relevant logs

Where we got lucky

These are good things that happened to us but not because we had planned for them.

The random decision to restart the pod worked

Follow-up actions

Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in infrastructure/, they can be in other repositories.

Process improvements

A checklist of what "top 5" things to try when faced with an incident would help a lot, but it also needs to have steps that ensure any logs are preserved in an issue before destructive actions are taken and they are lost
- https://github.com/2i2c-org/infrastructure/issues/1617

Documentation improvements

{{ summary }} [link to github issue]
{{ summary }} [link to github issue]

Technical improvements

We need to make the 2i2c cluster regional to prevent these kinds of outages
- https://github.com/2i2c-org/infrastructure/issues/1102

sgibson91 commented 2 years ago

Do you ship the logs anywhere?

Not to my knowledge

abkfenris commented 2 years ago

Do you ship the logs anywhere?

Not to my knowledge

Dang. GCP or whomever may have caught the logs anyways if you poke around in the console. Loki also collects them, not sure if you have that set up as part of your Grafana.

sgibson91 commented 2 years ago

Yuvi has just informed me that we do indeed have logs in the GCP console! They're a bit hard to read though, but there are a lot of CancelledErrors stemming from asyncio from before the incident report.

abkfenris commented 2 years ago

Try looking back before 7 or so Eastern. That's when I first had issues.

abkfenris commented 2 years ago

Does the hub pod share the NFS mount with the users? Could it have been affected by the same space issue as we hit yesterday, and we just didn't have anyone else try to launch a server after that fix rolled out?

sgibson91 commented 2 years ago

@abkfenris We suspect this is an issue with the availability of the k8s master. We are seeing spawning processes being cancelled, should not be related to the NFS. The 2i2c cluster is not regional, so it does not have high availability of the k8s master so we see issues like this occasionally and the fix is to restart the hub pod. We have an issue to move the cluster to be a regional one, but it would be a destructive process and we need to coordinate appropriate downtime with everyone who has a hub running on this cluster: https://github.com/2i2c-org/infrastructure/issues/1102

yuvipanda commented 2 years ago

I found these logs in another hub that had the same symptoms at the same time:

client_session: <aiohttp.client.ClientSession object at 0x7f76cb1a4790>
[E 2022-08-10 04:55:25.413 JupyterHub reflector:351] Watching resources never recovered, giving up
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/kubespawner/reflector.py", line 285, in _watch_and_update
        resource_version = await self._list_and_update()
      File "/usr/local/lib/python3.9/site-packages/kubespawner/reflector.py", line 228, in _list_and_update
        initial_resources_raw = await list_method(**kwargs)
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
        response_data = await self.request(
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET
        return (await self.request("GET", url,
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/rest.py", line 177, in request
        r = await self.pool_manager.request(**args)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/client.py", line 535, in _request
        conn = await self._connector.connect(
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 542, in connect
        proto = await self._create_connection(req, traces, timeout)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 907, in _create_connection
        _, proto = await self._create_direct_connection(req, traces, timeout)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
        raise last_exc
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
        transp, proto = await self._wrap_create_connection(
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection
        raise client_error(req.connection_key, exc) from exc
    aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.3.240.1:443 ssl:default [Connect call failed ('10.3.240.1', 443)]

[C 2022-08-10 04:55:25.414 JupyterHub spawner:2326] Pods reflector failed, halting Hub.
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending name='Task-3' coro=<shared_client.<locals>.close_client_task() running at /usr/local/lib/python3.9/site-packages/kubespawner/clients.py:58> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f76cb113280>()]>>
Exception ignored in: <coroutine object shared_client.<locals>.close_client_task at 0x7f76d0662dc0>
RuntimeError: coroutine ignored GeneratorExit

This is definitely due to the GKE master having a hiccup. It usually recovers shortly, but it looks like the jupyterhub process isn't :( We should definitely report this upstream.

yuvipanda commented 2 years ago

The hub also has a 'shut down' button in the admin panel that would also fix this specific problem, where you just see 'server requested' and nothing happens.

abkfenris commented 2 years ago

The hub also has a 'shut down' button in the admin panel that would also fix this specific problem, where you just see 'server requested' and nothing happens.

Do you mean 'Shutdown Hub' or 'Stop All' or the per user 'Stop Server'? I had tried 'Stop server' on my own server when it was in that state.

sgibson91 commented 2 years ago

The "Shutdown hub" button will restart the hub

sgibson91 commented 2 years ago

FYI, location of the logs in GCP: https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_container%22%0Aresource.labels.container_name%3D%22hub%22%0Aresource.labels.namespace_name%3D%22ohw%22;timeRange=2022-08-10T14:22:12.000Z%2F2022-08-10T15:30:12.000Z;cursorTimestamp=2022-08-10T14:55:01.930889041Z?authuser=1&project=two-eye-two-see

sgibson91 commented 2 years ago

I am going to close this issue now because:

The incident is over
The After Incident report has been posted and updated
All items in the After Incident report have links to other issues for tracking

abkfenris commented 2 years ago

Just to clarify, the the 'Shutdown Hub' button asks the hub pod to terminate itself, so that Kubernetes replaces it?

I ask, as I caused an incident in high school when I found a shutdown & restart button on the compute cluster that I wasn't supposed to have access too. Unsurprisingly I clicked it, then everything went down for hundreds of students and faculty. One screen broken from a classmate putting their fist through it later, it came back up as I thankfully hit restart, but I was hanging out with the tech crew for the rest of the class period making sure that no one else could find the same bug.

sgibson91 commented 2 years ago

Just to clarify, the the 'Shutdown Hub' button asks the hub pod to terminate itself, so that Kubernetes replaces it?

Yes, that's correct!

abkfenris commented 2 years ago

We just (~12:26 Eastern) had the hub lock up again, but things seemed to recover immediately after hitting the 'Shutdown Hub' button.

2i2c-org / infrastructure

[Incident] OceanHackWeek hub cannot start server #1616

Summary

Impact on users

Important information

Tasks and updates

Problem

Recreation

After-action report

Timeline

What went wrong

Where we got lucky

Follow-up actions

Process improvements

Documentation improvements

Technical improvements