Closed sgibson91 closed 2 years ago
When requesting a server only results in the "Server Requested" log and does not proceed. Alex reported in the ticket this eventually changes to "Your server is stopping, you will be able to start is again once it has finished stopping."
I logged in and tried to start a server myself and also got the "Server Requested" log and no further. I have not yet seen the "Your server is stopping..." message.
I activated kubectl access to the hub and looked at the events of my pod with kubectl describe pod jupyter-sgibson91
which revealed:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 17m gke.io/optimize-utilization-scheduler Successfully assigned ohw/jupyter-sgibson91 to gke-pilot-hubs-cluster-nb-ohw-e4ba9924-lj4s
Normal Pulled 17m kubelet Container image "ghcr.io/oceanhackweek/python:d98b914" already present on machine
Normal Created 17m kubelet Created container notebook
Normal Started 17m kubelet Started container notebook
Which indicates to me the server started successfully but none of the logs were streamed to the spawning page, and indeed the redirection to the server did not occur.
I tried stopping my server from the UI and then deleting the pod and this is when I see the "Your server is stopping..." message and appear to be stuck there.
All other pods in the namespace are running:
NAME READY STATUS RESTARTS AGE
api-ohw-dask-gateway-b45486c7b-62m9j 1/1 Running 0 12d
continuous-image-puller-44pjd 1/1 Running 0 47h
controller-ohw-dask-gateway-7f5dd9b6dd-ln9m7 1/1 Running 0 48d
hub-86b885f486-45pms 2/2 Running 1 5d1h
jupyter-abkfenris 1/1 Running 0 4h7m
jupyter-almacarolina 1/1 Running 0 46h
jupyter-anujjain2579 1/1 Running 0 10h
jupyter-clairedavies 1/1 Running 0 8h
jupyter-gmanuch 1/1 Running 0 17h
jupyter-jinjintwice 1/1 Running 0 9h
jupyter-leonardolaiolo 1/1 Running 0 8h
jupyter-noraloose 1/1 Running 0 19m
proxy-5bc58b774d-nmx9j 1/1 Running 0 34d
traefik-ohw-dask-gateway-694d9776f6-48hxg 1/1 Running 0 40d
I tried restarting the hub pod (by deleting it) and that has resolved the issue, I can now start a server. Not entirely sure what happened though.
Do the events from the hub pod show any issues from probe failures?
Unfortunately I don't know how to retrieve logs from before the restart. I tried the following with no luck:
❯ k logs -c hub hub-86b885f486-mwnqt --previous
Error from server (BadRequest): previous terminated container "hub" in pod "hub-86b885f486-mwnqt" not found
I believe k describe hub-86b885f486-mwnqt
would show probe events, but it looks like the JupyterHub healthcheck that the probes hit is pretty shallow, so they may not be the most effective at catching when a hub pod has an issue.
Do you ship the logs anywhere?
These sections should be filled out once we've resolved the incident and know what happened. They should focus on the knowledge we've gained and any improvements we should take.
A short list of dates / times and major updates, with links to relevant comments in the issue for more context.
All times in BST (UTC+1).
Things that could have gone better. Ideally these should result in concrete action items that have GitHub issues created for them and linked to under Action items.
These are good things that happened to us but not because we had planned for them.
Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in infrastructure/
, they can be in other repositories.
Do you ship the logs anywhere?
Not to my knowledge
Do you ship the logs anywhere?
Not to my knowledge
Dang. GCP or whomever may have caught the logs anyways if you poke around in the console. Loki also collects them, not sure if you have that set up as part of your Grafana.
Yuvi has just informed me that we do indeed have logs in the GCP console! They're a bit hard to read though, but there are a lot of CancelledErrors
stemming from asyncio from before the incident report.
Try looking back before 7 or so Eastern. That's when I first had issues.
Does the hub pod share the NFS mount with the users? Could it have been affected by the same space issue as we hit yesterday, and we just didn't have anyone else try to launch a server after that fix rolled out?
@abkfenris We suspect this is an issue with the availability of the k8s master. We are seeing spawning processes being cancelled, should not be related to the NFS. The 2i2c cluster is not regional, so it does not have high availability of the k8s master so we see issues like this occasionally and the fix is to restart the hub pod. We have an issue to move the cluster to be a regional one, but it would be a destructive process and we need to coordinate appropriate downtime with everyone who has a hub running on this cluster: https://github.com/2i2c-org/infrastructure/issues/1102
I found these logs in another hub that had the same symptoms at the same time:
client_session: <aiohttp.client.ClientSession object at 0x7f76cb1a4790>
[E 2022-08-10 04:55:25.413 JupyterHub reflector:351] Watching resources never recovered, giving up
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/kubespawner/reflector.py", line 285, in _watch_and_update
resource_version = await self._list_and_update()
File "/usr/local/lib/python3.9/site-packages/kubespawner/reflector.py", line 228, in _list_and_update
initial_resources_raw = await list_method(**kwargs)
File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
response_data = await self.request(
File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET
return (await self.request("GET", url,
File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/rest.py", line 177, in request
r = await self.pool_manager.request(**args)
File "/usr/local/lib/python3.9/site-packages/aiohttp/client.py", line 535, in _request
conn = await self._connector.connect(
File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 542, in connect
proto = await self._create_connection(req, traces, timeout)
File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 907, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
raise last_exc
File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
transp, proto = await self._wrap_create_connection(
File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.3.240.1:443 ssl:default [Connect call failed ('10.3.240.1', 443)]
[C 2022-08-10 04:55:25.414 JupyterHub spawner:2326] Pods reflector failed, halting Hub.
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending name='Task-3' coro=<shared_client.<locals>.close_client_task() running at /usr/local/lib/python3.9/site-packages/kubespawner/clients.py:58> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f76cb113280>()]>>
Exception ignored in: <coroutine object shared_client.<locals>.close_client_task at 0x7f76d0662dc0>
RuntimeError: coroutine ignored GeneratorExit
This is definitely due to the GKE master having a hiccup. It usually recovers shortly, but it looks like the jupyterhub process isn't :( We should definitely report this upstream.
The hub also has a 'shut down' button in the admin panel that would also fix this specific problem, where you just see 'server requested' and nothing happens.
The hub also has a 'shut down' button in the admin panel that would also fix this specific problem, where you just see 'server requested' and nothing happens.
Do you mean 'Shutdown Hub' or 'Stop All' or the per user 'Stop Server'? I had tried 'Stop server' on my own server when it was in that state.
The "Shutdown hub" button will restart the hub
I am going to close this issue now because:
Just to clarify, the the 'Shutdown Hub' button asks the hub pod to terminate itself, so that Kubernetes replaces it?
I ask, as I caused an incident in high school when I found a shutdown & restart button on the compute cluster that I wasn't supposed to have access too. Unsurprisingly I clicked it, then everything went down for hundreds of students and faculty. One screen broken from a classmate putting their fist through it later, it came back up as I thankfully hit restart, but I was hanging out with the tech crew for the rest of the class period making sure that no one else could find the same bug.
Just to clarify, the the 'Shutdown Hub' button asks the hub pod to terminate itself, so that Kubernetes replaces it?
Yes, that's correct!
We just (~12:26 Eastern) had the hub lock up again, but things seemed to recover immediately after hitting the 'Shutdown Hub' button.
Summary
OceanHackWeek hub cannot start a server. Reported in https://2i2c.freshdesk.com/a/tickets/172
Impact on users
Hub unusable as no one can start a server.
Important information
Tasks and updates
After-action report template
``` # After-action report These sections should be filled out once we've resolved the incident and know what happened. They should focus on the knowledge we've gained and any improvements we should take. ## Timeline _A short list of dates / times and major updates, with links to relevant comments in the issue for more context._ All times in {{ most convenient timezone}}. - {{ yyyy-mm-dd }} - [Summary of first update](link to comment) - {{ yyyy-mm-dd }} - [Summary of another update](link to comment) - {{ yyyy-mm-dd }} - [Summary of final update](link to comment) ## What went wrong _Things that could have gone better. Ideally these should result in concrete action items that have GitHub issues created for them and linked to under Action items._ - Thing one - Thing two ## Where we got lucky _These are good things that happened to us but not because we had planned for them._ - Thing one - Thing two ## Follow-up actions _Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in `infrastructure/`, they can be in other repositories._ ### Process improvements 1. {{ summary }} [link to github issue] 2. {{ summary }} [link to github issue] ### Documentation improvements 1. {{ summary }} [link to github issue] 2. {{ summary }} [link to github issue] ### Technical improvements 1. {{ summary }} [link to github issue] 2. {{ summary }} [link to github issue] ```