Open simon-mo opened 5 years ago
Do you recall the node?
Yes!
k describe pod user-scheduler-598c5566b-q9c6p -n data100-prod
Name: user-scheduler-598c5566b-q9c6p
Namespace: data100-prod
Priority: 0
PriorityClassName: <none>
Node: aks-nodepool1-24491010-4/10.240.0.10
Start Time: Wed, 13 Mar 2019 17:44:54 -0700
Labels: app=jupyterhub
component=user-scheduler
pod-template-hash=154711226
release=data100-prod
Annotations: checksum/config-map=38e3f1e34962404b361b1fe328e81426c0d615fd85d3b70ad0192b2cee7aa321
Status: Running
IP: 10.244.5.239
Controlled By: ReplicaSet/user-scheduler-598c5566b
Containers:
user-scheduler:
Container ID: docker://9acdf464edbb6b8cab96fff14e1bcfbb92c60a641c9eca4dccdd72ec9b0273c5
Image: gcr.io/google_containers/kube-scheduler-amd64:v1.11.2
Image ID: docker-pullable://gcr.io/google_containers/kube-scheduler-amd64@sha256:3497015ad3c422f7cdd49d7157d7155c176ec34b9d692ef1ea00c5c6804f1816
Port: <none>
Host Port: <none>
Command:
/usr/local/bin/kube-scheduler
--scheduler-name=data100-prod-user-scheduler
--policy-configmap=user-scheduler
--policy-configmap-namespace=data100-prod
--lock-object-name=user-scheduler
--lock-object-namespace=data100-prod
--leader-elect-resource-lock=configmaps
--v=4
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 18 Mar 2019 18:18:05 -0700
Finished: Mon, 18 Mar 2019 18:18:15 -0700
Ready: False
Restart Count: 10
Requests:
cpu: 50m
memory: 256Mi
Liveness: http-get http://:10251/healthz delay=15s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:10251/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
KUBERNETES_PORT_443_TCP_ADDR: data100-19-data100-19s-3f9d81-62c7e3ef.hcp.westus2.azmk8s.io
KUBERNETES_PORT: tcp://data100-19-data100-19s-3f9d81-62c7e3ef.hcp.westus2.azmk8s.io:443
KUBERNETES_PORT_443_TCP: tcp://data100-19-data100-19s-3f9d81-62c7e3ef.hcp.westus2.azmk8s.io:443
KUBERNETES_SERVICE_HOST: data100-19-data100-19s-3f9d81-62c7e3ef.hcp.westus2.azmk8s.io
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from user-scheduler-token-h2fhp (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
user-scheduler-token-h2fhp:
Type: Secret (a volume populated by a Secret)
SecretName: user-scheduler-token-h2fhp
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 19m (x7 over 5d) kubelet, aks-nodepool1-24491010-4 Created container
Normal Started 19m (x7 over 5d) kubelet, aks-nodepool1-24491010-4 Started container
Normal Pulled 19m (x6 over 4d) kubelet, aks-nodepool1-24491010-4 Container image "gcr.io/google_containers/kube-scheduler-amd64:v1.11.2" already present on machine
Warning Unhealthy 19m (x5 over 4d) kubelet, aks-nodepool1-24491010-4 Readiness probe failed: Get http://10.244.5.239:10251/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning BackOff 1m (x93 over 4d) kubelet, aks-nodepool1-24491010-4 Back-off restarting failed container
Thanks! The health check is fine from the node which also seems fine. The oom-killer ran a bunch of times:
# dmesg | grep -c mem.cgroup.out.of.memory
15
But they were ZMQ processes mostly, so students hitting kernel limits.
Incident reported at 6:19 that the prod datahub says "Service Unavailable". Querying the kubectl shows that user-scheduler was in crash loop backoff.
The state was restored after 8 tries of restart.