cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link

Bug: JupyterHub Unstable? #956

Closed edasmalchi closed 2 years ago

edasmalchi commented 2 years ago

Describe the bug While running some (intensive) analysis code around 3:45 Pacific on Friday 1/14, my JupyterHub stopped responding and would not reload.

Attempting to refresh the page yielded a white screen and a waiting for hubtest.k8s.calitp.jarv.us from my browser before eventually reloading after 10 minutes, but...

Upon reloading, notebooks don't seem to render properly (screenshot attached)

To Reproduce Unclear how to reproduce.

Expected behavior Normal hub functionality.

Additional context

strange_nb
natam1 commented 2 years ago

Also just had my JupyterHub crash while running some cells at 11:21 pm pacific on Monday 1/17. Getting a white screen and a Service Unavailable note.

holly-g commented 2 years ago

Hey @lottspot , after you wrap up #815, could you take a look into this? ETA: James will prioritize investigating this week

charlie-costanzo commented 2 years ago

In addition to noticing the unresponsiveness and 503 error during the same time that Eric did on Friday, 1/14, the same thing is happening now on Wed, 1/19 around 11:45am ET

Screen Shot 2022-01-19 at 11 49 16 AM
lottspot commented 2 years ago

I can see on inbound traffic logs that there were 503s happening for user charlie-costanzo around 10:17 AM EST, but I don't have insight into user container logs around that time, as those errors seem to have coincided with a restart of the user container. I do have logged errors in relation to the reported 11:45 AM EST event though.

[E 2022-01-19 16:45:25.539 SingleUserLabApp mixins:579] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.9/site-packages/jupyterhub/singleuser/mixins.py", line 577, in notify
        await client.fetch(req)
    tornado.simple_httpclient.HTTPTimeoutError: Timeout during request

Not yet clear on the cause

lottspot commented 2 years ago

I got that a bit backwards actually-- the 503s I was looking at are API calls which were dispatched to the hub from the user container. So the error message logged above is just another look at the same reported symptom

lauriemerrell commented 2 years ago

I am getting a white screen with nothing loading right now, 3:08 pm Eastern 1/19/22.

I think I got the "Launch Server" prompt, clicked the button to launch, and then got the white screen with nothing loading. I did not have a chance to open or run any notebooks or anything.

image

lottspot commented 2 years ago

@lauriemerrell I'm not sure your issue is related-- I'm not seeing the same server side errors for your user

lottspot commented 2 years ago

I can see that the hub restarted at 11:44; trying to track down cluster events related to this restart

{
  "textPayload": "[I 2022-01-19 16:44:37.946 JupyterHub proxy:347] Checking routes",
  "insertId": "3do3l97kr2ehd1dz",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "project_id": "cal-itp-data-infra",
      "pod_name": "hub-8458cb5678-vpjvq",
      "cluster_name": "data-infra-apps",
      "namespace_name": "jupyterhub",
      "container_name": "hub",
      "location": "us-west1"
    }
  },
  "timestamp": "2022-01-19T16:44:38.226238312Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/hub_jupyter_org/network-access-proxy-http": "true",
    "k8s-pod/hub_jupyter_org/network-access-proxy-api": "true",
    "k8s-pod/hub_jupyter_org/network-access-singleuser": "true",
    "k8s-pod/pod-template-hash": "8458cb5678",
    "k8s-pod/release": "jupyterhub",
    "compute.googleapis.com/resource_name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
    "k8s-pod/app": "jupyterhub",
    "k8s-pod/component": "hub"
  },
  "logName": "projects/cal-itp-data-infra/logs/stderr",
  "receiveTimestamp": "2022-01-19T16:45:06.196846686Z"
}
{
  "textPayload": "Loading /usr/local/etc/jupyterhub/secret/values.yaml",
  "insertId": "gr4iccadm3007qbo",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "location": "us-west1",
      "project_id": "cal-itp-data-infra",
      "container_name": "hub",
      "pod_name": "hub-8458cb5678-lrsb7",
      "namespace_name": "jupyterhub",
      "cluster_name": "data-infra-apps"
    }
  },
  "timestamp": "2022-01-19T17:01:42.207315119Z",
  "severity": "INFO",
  "labels": {
    "k8s-pod/hub_jupyter_org/network-access-proxy-api": "true",
    "k8s-pod/hub_jupyter_org/network-access-singleuser": "true",
    "k8s-pod/component": "hub",
    "k8s-pod/hub_jupyter_org/network-access-proxy-http": "true",
    "k8s-pod/release": "jupyterhub",
    "k8s-pod/pod-template-hash": "8458cb5678",
    "k8s-pod/app": "jupyterhub",
    "compute.googleapis.com/resource_name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p"
  },
  "logName": "projects/cal-itp-data-infra/logs/stdout",
  "receiveTimestamp": "2022-01-19T17:01:50.065449382Z"
}
tiffanychu90 commented 2 years ago

I usually get the 503 white screen after I've printed too many maps in a loop, kicking off a 413(?) file save error.

amandaha8 commented 2 years ago

Yesterday, my Jupyter Hub stopped working around 2-3 with the message "503 Service Temporarily Unavailable". It began working again around 9 today, but stopped again around 12 today without an error messages. Now I either see just the screen paused at the logo or a completely white screen or it's able to load to where I can create notebooks/open terminals/etc but all of my files are missing & it won't let me change directories to data-analyses.

natam1 commented 2 years ago

A bit more detail on the Jupyter Hub issues I got on Monday at 11:21 pm pacific: The file save error came up first, then after a few rounds of that coming up, I got a prompt to reconnect the hub in a new tab, which brought me to an all-white screen (similar to @lauriemerrell) with the Service Unavailable note in the top left corner. Jupyter Hub worked again after restarting and reconnecting/logging back into the Hub.

Today my Jupyter Hub also stopped working with the 503 Error around 9:00 am pacific, after initially working from 8:00 - 8:45 am pacific. Later in the day around 12:00 pm pacific, my cells were not producing an output when I ran them, but the notebook was saving and my terminal was working.

lottspot commented 2 years ago

It appears this morning's incident is owed to the fact that the cluster lost connectivity to the node gke-data-infra-apps-apps-v1-f87e3b02-e65p

The smoking gun:

{
  "insertId": "uf7wfg29qsy8o",
  "jsonPayload": {
    "reportingInstance": "",
    "message": "0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict.",
    "reportingComponent": "",
    "reason": "FailedScheduling",
    "kind": "Event",
    "involvedObject": {
      "namespace": "jupyterhub",
      "kind": "Pod",
      "uid": "27ec1a2f-4cd2-4f33-9176-f34f6d010aab",
      "name": "hub-8458cb5678-lrsb7",
      "resourceVersion": "167872845",
      "apiVersion": "v1"
    },
    "eventTime": null,
    "source": {
      "component": "default-scheduler"
    },
    "metadata": {
      "uid": "17d925a3-1e02-47f6-8f7e-fa5136b06889",
      "managedFields": [
        {
          "manager": "kube-scheduler",
          "fieldsType": "FieldsV1",
          "apiVersion": "v1",
          "fieldsV1": {
            "f:count": {},
            "f:type": {},
            "f:lastTimestamp": {},
            "f:firstTimestamp": {},
            "f:message": {},
            "f:source": {
              "f:component": {}
            },
            "f:reason": {},
            "f:involvedObject": {
              "f:kind": {},
              "f:apiVersion": {},
              "f:resourceVersion": {},
              "f:namespace": {},
              "f:uid": {},
              "f:name": {}
            }
          },
          "time": "2022-01-19T16:51:12Z",
          "operation": "Update"
        }
      ],
      "namespace": "jupyterhub",
      "resourceVersion": "215371",
      "name": "hub-8458cb5678-lrsb7.16cbba1156d37847",
      "creationTimestamp": "2022-01-19T16:51:12Z",
      "selfLink": "/api/v1/namespaces/jupyterhub/events/hub-8458cb5678-lrsb7.16cbba1156d37847"
    },
    "apiVersion": "v1",
    "type": "Warning"
  },
  "resource": {
    "type": "k8s_pod",
    "labels": {
      "project_id": "cal-itp-data-infra",
      "cluster_name": "data-infra-apps",
      "pod_name": "hub-8458cb5678-lrsb7",
      "location": "us-west1",
      "namespace_name": "jupyterhub"
    }
  },
  "timestamp": "2022-01-19T16:52:24Z",
  "severity": "WARNING",
  "logName": "projects/cal-itp-data-infra/logs/events",
  "receiveTimestamp": "2022-01-19T16:52:29.622209076Z"
}

That error means that the hub pod is tightly bound to its node (a limitation identified in #448) and the one node on which it can be placed is unreachable/down. I was able to identify which node that is by simply checking which node it is currently running on, since it eventually came back to that node.

A few minutes before this, we can see a kernel parameter was changed/updated on the node in question

{
  "insertId": "ozhkanfp5oa81",
  "jsonPayload": {
    "metadata": {
      "creationTimestamp": "2022-01-19T16:23:47Z",
      "managedFields": [
        {
          "manager": "node-problem-detector",
          "time": "2022-01-19T16:23:47Z",
          "fieldsV1": {
            "f:type": {},
            "f:reason": {},
            "f:lastTimestamp": {},
            "f:firstTimestamp": {},
            "f:involvedObject": {
              "f:uid": {},
              "f:kind": {},
              "f:name": {}
            },
            "f:count": {},
            "f:source": {
              "f:host": {},
              "f:component": {}
            }
          },
          "apiVersion": "v1",
          "operation": "Update",
          "fieldsType": "FieldsV1"
        }
      ],
      "selfLink": "/api/v1/namespaces/default/events/gke-data-infra-apps-apps-v1-f87e3b02-e65p.16cb7da2d16c6ff6",
      "resourceVersion": "215211",
      "namespace": "default",
      "name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p.16cb7da2d16c6ff6",
      "uid": "9214f827-b12e-4775-80b3-0fbab1604f2f"
    },
    "reportingComponent": "",
    "apiVersion": "v1",
    "type": "Warning",
    "source": {
      "component": "sysctl-monitor",
      "host": "gke-data-infra-apps-apps-v1-f87e3b02-e65p"
    },
    "reportingInstance": "",
    "involvedObject": {
      "name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
      "kind": "Node",
      "uid": "gke-data-infra-apps-apps-v1-f87e3b02-e65p"
    },
    "reason": "NodeSysctlChange",
    "eventTime": null,
    "kind": "Event"
  },
  "resource": {
    "type": "k8s_node",
    "labels": {
      "node_name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
      "cluster_name": "data-infra-apps",
      "location": "us-west1",
      "project_id": "cal-itp-data-infra"
    }
  },
  "timestamp": "2022-01-19T16:23:47Z",
  "severity": "WARNING",
  "logName": "projects/cal-itp-data-infra/logs/events",
  "receiveTimestamp": "2022-01-19T16:23:52.472848734Z"
}

Based on this evidence, I think that today's issue was GKE infrastructure level (i.e., transient). Not sure yet about the 1/17 and 1/14 instances of the problem, which I should still have logging for; looking into those.

lottspot commented 2 years ago

These log entries don't line up time-wise with the 1/14 report, but they are almost certainly a hint about what's going on Screenshot from 2022-01-19 17-41-05

lottspot commented 2 years ago

On the night of 1/17, the node gke-data-infra-apps-apps-v1-f87e3b02-e65p oom killed what I can only assume was a jupyterhub operation

{
  "insertId": "1e88lpdf9mgv7x",
  "jsonPayload": {
    "source": {
      "host": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
      "component": "kubelet"
    },
    "eventTime": null,
    "involvedObject": {
      "uid": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
      "name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
      "kind": "Node"
    },
    "reportingComponent": "",
    "message": "System OOM encountered, victim process: python, pid: 585821",
    "reason": "SystemOOM",
    "reportingInstance": "",
    "kind": "Event",
    "metadata": {
      "creationTimestamp": "2022-01-18T07:06:41Z",
      "uid": "4bc1c3d6-61d8-4174-af94-8ec200cd8fbc",
      "name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p.16cb4b97097f1efb",
      "selfLink": "/api/v1/namespaces/default/events/gke-data-infra-apps-apps-v1-f87e3b02-e65p.16cb4b97097f1efb",
      "namespace": "default",
      "resourceVersion": "212927",
      "managedFields": [
        {
          "operation": "Update",
          "time": "2022-01-18T07:06:41Z",
          "fieldsV1": {
            "f:type": {},
            "f:involvedObject": {
              "f:name": {},
              "f:uid": {},
              "f:kind": {}
            },
            "f:reason": {},
            "f:source": {
              "f:host": {},
              "f:component": {}
            },
            "f:count": {},
            "f:lastTimestamp": {},
            "f:message": {},
            "f:firstTimestamp": {}
          },
          "apiVersion": "v1",
          "manager": "kubelet",
          "fieldsType": "FieldsV1"
        }
      ]
    },
    "type": "Warning",
    "apiVersion": "v1"
  },
  "resource": {
    "type": "k8s_node",
    "labels": {
      "cluster_name": "data-infra-apps",
      "node_name": "gke-data-infra-apps-apps-v1-f87e3b02-e65p",
      "location": "us-west1",
      "project_id": "cal-itp-data-infra"
    }
  },
  "timestamp": "2022-01-18T07:06:41Z",
  "severity": "WARNING",
  "logName": "projects/cal-itp-data-infra/logs/events",
  "receiveTimestamp": "2022-01-18T07:06:46.181175696Z"
}
lottspot commented 2 years ago

All things taken in totality, I think we have multiple tiers of issues in play here.

  1. Certain operations cause workstations to consume more memory than they are allowed, causing the kubernetes system to kill the workstation. This would be addressed by work on #592.
  2. Under certain conditions, these operations may cause the underlying linux system to run out of memory first, resulting in the linux system killing the workstation.
  3. The problem caused by the outcomes of 1 and 2 (namely, the workstation becoming unavailable) is compounded by the fact that workstations currently cannot freely float between nodes. This would be addressed by work on #448.
  4. Problem 2 is made possible by the fact that workstations cannot freely float between nodes, and are mostly concentrated together on the same node (gke-data-infra-apps-apps-v1-cd9a2855-ujw3). Again, a direct derivative of an unsolved #448.

There are two different holistic solutions that come to mind for me:

  1. Immediately prioritize work on #448 and #592, followed by a cluster node upsize
  2. Deploy a separate node pool dedicated to Jupyterhub workstations, followed by less critical work on #592

I actually favor the second solution here. Workstations seem to require the ability to significantly burst their memory usage in ways that the other workloads running on the cluster right now do not. It both makes sense to run workstations in a separate compute pool which can provide adequate memory for those moments and to isolate them such that those moments don't adversely affect other workloads, particularly those with stringent uptime requirements like the gtfs-rt-archiver (I already have reason to believe that the archiver may have been affected by some of these events). The big drawback to this approach is additional cost. For a next step, I will need to spec out a cost estimate for adding such a pool and receive approval to incur it. I've opened #962 for this task and will be closing this issue since investigation has been completed.