Closed edasmalchi closed 2 years ago
Also just had my JupyterHub crash while running some cells at 11:21 pm pacific on Monday 1/17. Getting a white screen and a Service Unavailable
note.
Hey @lottspot , after you wrap up #815, could you take a look into this? ETA: James will prioritize investigating this week
In addition to noticing the unresponsiveness and 503 error during the same time that Eric did on Friday, 1/14, the same thing is happening now on Wed, 1/19 around 11:45am ET
I can see on inbound traffic logs that there were 503s happening for user charlie-costanzo
around 10:17 AM EST, but I don't have insight into user container logs around that time, as those errors seem to have coincided with a restart of the user container. I do have logged errors in relation to the reported 11:45 AM EST event though.
[E 2022-01-19 16:45:25.539 SingleUserLabApp mixins:579] Error notifying Hub of activity
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/jupyterhub/singleuser/mixins.py", line 577, in notify
await client.fetch(req)
tornado.simple_httpclient.HTTPTimeoutError: Timeout during request
Not yet clear on the cause
I got that a bit backwards actually-- the 503s I was looking at are API calls which were dispatched to the hub from the user container. So the error message logged above is just another look at the same reported symptom
I am getting a white screen with nothing loading right now, 3:08 pm Eastern 1/19/22.
I think I got the "Launch Server" prompt, clicked the button to launch, and then got the white screen with nothing loading. I did not have a chance to open or run any notebooks or anything.
@lauriemerrell I'm not sure your issue is related-- I'm not seeing the same server side errors for your user
I can see that the hub restarted at 11:44; trying to track down cluster events related to this restart
{
"textPayload": "[I 2022-01-19 16:44:37.946 JupyterHub proxy:347] Checking routes",
"insertId": "3do3l97kr2ehd1dz",
"resource": {
"type": "k8s_container",
"labels": {
"project_id": "cal-itp-data-infra",
"pod_name": "hub-8458cb5678-vpjvq",
"cluster_name": "data-infra-apps",
"namespace_name": "jupyterhub",
"container_name": "hub",
"location": "us-west1"
}
},
"timestamp": "2022-01-19T16:44:38.226238312Z",
"severity": "ERROR",
"labels": {
"k8s-pod/hub_jupyter_org/network-access-proxy-http": "true",
"k8s-pod/hub_jupyter_org/network-access-proxy-api": "true",
"k8s-pod/hub_jupyter_org/network-access-singleuser": "true",
"k8s-pod/pod-template-hash": "8458cb5678",
"k8s-pod/release": "jupyterhub",
"compute.googleapis.com/resource_name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
"k8s-pod/app": "jupyterhub",
"k8s-pod/component": "hub"
},
"logName": "projects/cal-itp-data-infra/logs/stderr",
"receiveTimestamp": "2022-01-19T16:45:06.196846686Z"
}
{
"textPayload": "Loading /usr/local/etc/jupyterhub/secret/values.yaml",
"insertId": "gr4iccadm3007qbo",
"resource": {
"type": "k8s_container",
"labels": {
"location": "us-west1",
"project_id": "cal-itp-data-infra",
"container_name": "hub",
"pod_name": "hub-8458cb5678-lrsb7",
"namespace_name": "jupyterhub",
"cluster_name": "data-infra-apps"
}
},
"timestamp": "2022-01-19T17:01:42.207315119Z",
"severity": "INFO",
"labels": {
"k8s-pod/hub_jupyter_org/network-access-proxy-api": "true",
"k8s-pod/hub_jupyter_org/network-access-singleuser": "true",
"k8s-pod/component": "hub",
"k8s-pod/hub_jupyter_org/network-access-proxy-http": "true",
"k8s-pod/release": "jupyterhub",
"k8s-pod/pod-template-hash": "8458cb5678",
"k8s-pod/app": "jupyterhub",
"compute.googleapis.com/resource_name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p"
},
"logName": "projects/cal-itp-data-infra/logs/stdout",
"receiveTimestamp": "2022-01-19T17:01:50.065449382Z"
}
I usually get the 503 white screen after I've printed too many maps in a loop, kicking off a 413(?) file save error.
Yesterday, my Jupyter Hub stopped working around 2-3 with the message "503 Service Temporarily Unavailable". It began working again around 9 today, but stopped again around 12 today without an error messages. Now I either see just the screen paused at the logo or a completely white screen or it's able to load to where I can create notebooks/open terminals/etc but all of my files are missing & it won't let me change directories to data-analyses.
A bit more detail on the Jupyter Hub issues I got on Monday at 11:21 pm pacific: The file save error came up first, then after a few rounds of that coming up, I got a prompt to reconnect the hub in a new tab, which brought me to an all-white screen (similar to @lauriemerrell) with the Service Unavailable
note in the top left corner. Jupyter Hub worked again after restarting and reconnecting/logging back into the Hub.
Today my Jupyter Hub also stopped working with the 503 Error around 9:00 am pacific, after initially working from 8:00 - 8:45 am pacific. Later in the day around 12:00 pm pacific, my cells were not producing an output when I ran them, but the notebook was saving and my terminal was working.
It appears this morning's incident is owed to the fact that the cluster lost connectivity to the node gke-data-infra-apps-apps-v1-f87e3b02-e65p
The smoking gun:
{
"insertId": "uf7wfg29qsy8o",
"jsonPayload": {
"reportingInstance": "",
"message": "0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict.",
"reportingComponent": "",
"reason": "FailedScheduling",
"kind": "Event",
"involvedObject": {
"namespace": "jupyterhub",
"kind": "Pod",
"uid": "27ec1a2f-4cd2-4f33-9176-f34f6d010aab",
"name": "hub-8458cb5678-lrsb7",
"resourceVersion": "167872845",
"apiVersion": "v1"
},
"eventTime": null,
"source": {
"component": "default-scheduler"
},
"metadata": {
"uid": "17d925a3-1e02-47f6-8f7e-fa5136b06889",
"managedFields": [
{
"manager": "kube-scheduler",
"fieldsType": "FieldsV1",
"apiVersion": "v1",
"fieldsV1": {
"f:count": {},
"f:type": {},
"f:lastTimestamp": {},
"f:firstTimestamp": {},
"f:message": {},
"f:source": {
"f:component": {}
},
"f:reason": {},
"f:involvedObject": {
"f:kind": {},
"f:apiVersion": {},
"f:resourceVersion": {},
"f:namespace": {},
"f:uid": {},
"f:name": {}
}
},
"time": "2022-01-19T16:51:12Z",
"operation": "Update"
}
],
"namespace": "jupyterhub",
"resourceVersion": "215371",
"name": "hub-8458cb5678-lrsb7.16cbba1156d37847",
"creationTimestamp": "2022-01-19T16:51:12Z",
"selfLink": "/api/v1/namespaces/jupyterhub/events/hub-8458cb5678-lrsb7.16cbba1156d37847"
},
"apiVersion": "v1",
"type": "Warning"
},
"resource": {
"type": "k8s_pod",
"labels": {
"project_id": "cal-itp-data-infra",
"cluster_name": "data-infra-apps",
"pod_name": "hub-8458cb5678-lrsb7",
"location": "us-west1",
"namespace_name": "jupyterhub"
}
},
"timestamp": "2022-01-19T16:52:24Z",
"severity": "WARNING",
"logName": "projects/cal-itp-data-infra/logs/events",
"receiveTimestamp": "2022-01-19T16:52:29.622209076Z"
}
That error means that the hub pod is tightly bound to its node (a limitation identified in #448) and the one node on which it can be placed is unreachable/down. I was able to identify which node that is by simply checking which node it is currently running on, since it eventually came back to that node.
A few minutes before this, we can see a kernel parameter was changed/updated on the node in question
{
"insertId": "ozhkanfp5oa81",
"jsonPayload": {
"metadata": {
"creationTimestamp": "2022-01-19T16:23:47Z",
"managedFields": [
{
"manager": "node-problem-detector",
"time": "2022-01-19T16:23:47Z",
"fieldsV1": {
"f:type": {},
"f:reason": {},
"f:lastTimestamp": {},
"f:firstTimestamp": {},
"f:involvedObject": {
"f:uid": {},
"f:kind": {},
"f:name": {}
},
"f:count": {},
"f:source": {
"f:host": {},
"f:component": {}
}
},
"apiVersion": "v1",
"operation": "Update",
"fieldsType": "FieldsV1"
}
],
"selfLink": "/api/v1/namespaces/default/events/gke-data-infra-apps-apps-v1-f87e3b02-e65p.16cb7da2d16c6ff6",
"resourceVersion": "215211",
"namespace": "default",
"name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p.16cb7da2d16c6ff6",
"uid": "9214f827-b12e-4775-80b3-0fbab1604f2f"
},
"reportingComponent": "",
"apiVersion": "v1",
"type": "Warning",
"source": {
"component": "sysctl-monitor",
"host": "gke-data-infra-apps-apps-v1-f87e3b02-e65p"
},
"reportingInstance": "",
"involvedObject": {
"name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
"kind": "Node",
"uid": "gke-data-infra-apps-apps-v1-f87e3b02-e65p"
},
"reason": "NodeSysctlChange",
"eventTime": null,
"kind": "Event"
},
"resource": {
"type": "k8s_node",
"labels": {
"node_name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
"cluster_name": "data-infra-apps",
"location": "us-west1",
"project_id": "cal-itp-data-infra"
}
},
"timestamp": "2022-01-19T16:23:47Z",
"severity": "WARNING",
"logName": "projects/cal-itp-data-infra/logs/events",
"receiveTimestamp": "2022-01-19T16:23:52.472848734Z"
}
Based on this evidence, I think that today's issue was GKE infrastructure level (i.e., transient). Not sure yet about the 1/17 and 1/14 instances of the problem, which I should still have logging for; looking into those.
These log entries don't line up time-wise with the 1/14 report, but they are almost certainly a hint about what's going on
On the night of 1/17, the node gke-data-infra-apps-apps-v1-f87e3b02-e65p
oom killed what I can only assume was a jupyterhub operation
{
"insertId": "1e88lpdf9mgv7x",
"jsonPayload": {
"source": {
"host": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
"component": "kubelet"
},
"eventTime": null,
"involvedObject": {
"uid": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
"name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
"kind": "Node"
},
"reportingComponent": "",
"message": "System OOM encountered, victim process: python, pid: 585821",
"reason": "SystemOOM",
"reportingInstance": "",
"kind": "Event",
"metadata": {
"creationTimestamp": "2022-01-18T07:06:41Z",
"uid": "4bc1c3d6-61d8-4174-af94-8ec200cd8fbc",
"name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p.16cb4b97097f1efb",
"selfLink": "/api/v1/namespaces/default/events/gke-data-infra-apps-apps-v1-f87e3b02-e65p.16cb4b97097f1efb",
"namespace": "default",
"resourceVersion": "212927",
"managedFields": [
{
"operation": "Update",
"time": "2022-01-18T07:06:41Z",
"fieldsV1": {
"f:type": {},
"f:involvedObject": {
"f:name": {},
"f:uid": {},
"f:kind": {}
},
"f:reason": {},
"f:source": {
"f:host": {},
"f:component": {}
},
"f:count": {},
"f:lastTimestamp": {},
"f:message": {},
"f:firstTimestamp": {}
},
"apiVersion": "v1",
"manager": "kubelet",
"fieldsType": "FieldsV1"
}
]
},
"type": "Warning",
"apiVersion": "v1"
},
"resource": {
"type": "k8s_node",
"labels": {
"cluster_name": "data-infra-apps",
"node_name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
"location": "us-west1",
"project_id": "cal-itp-data-infra"
}
},
"timestamp": "2022-01-18T07:06:41Z",
"severity": "WARNING",
"logName": "projects/cal-itp-data-infra/logs/events",
"receiveTimestamp": "2022-01-18T07:06:46.181175696Z"
}
All things taken in totality, I think we have multiple tiers of issues in play here.
gke-data-infra-apps-apps-v1-cd9a2855-ujw3
). Again, a direct derivative of an unsolved #448.There are two different holistic solutions that come to mind for me:
I actually favor the second solution here. Workstations seem to require the ability to significantly burst their memory usage in ways that the other workloads running on the cluster right now do not. It both makes sense to run workstations in a separate compute pool which can provide adequate memory for those moments and to isolate them such that those moments don't adversely affect other workloads, particularly those with stringent uptime requirements like the gtfs-rt-archiver (I already have reason to believe that the archiver may have been affected by some of these events). The big drawback to this approach is additional cost. For a next step, I will need to spec out a cost estimate for adding such a pool and receive approval to incur it. I've opened #962 for this task and will be closing this issue since investigation has been completed.
Describe the bug While running some (intensive) analysis code around 3:45 Pacific on Friday 1/14, my JupyterHub stopped responding and would not reload.
Attempting to refresh the page yielded a white screen and a
waiting for hubtest.k8s.calitp.jarv.us
from my browser before eventually reloading after 10 minutes, but...Upon reloading, notebooks don't seem to render properly (screenshot attached)
To Reproduce Unclear how to reproduce.
Expected behavior Normal hub functionality.
Additional context