Open jhamman opened 5 years ago
@jhamman I guess this is because the --wait
makes it wait until the loadbalancer IP has become available. Makes sense to bump it - perhaps as an option in hubploy.yaml?
@scottyhq - have you still been getting timeouts? I think things have actually stabilized...
we ran into this same error again with a new deployment https://circleci.com/gh/pangeo-data/pangeo-cloud-federation/480
our manual work-around is documented here https://github.com/pangeo-data/pangeo-cloud-federation/pull/207
digging in a bit further this second time, it seems like the hub pod is stuck in pending b/c of the resources available (we are running two hubs - staging, prod - on a single m5.large instance). Here are some relevant commands and output
kubectl get pods --namespace nasa-prod -o wide
(aws) scott@pangeo-cloud-federation:kubectl get pods --namespace nasa-prod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
hub-7bcfb9b656-nd4qt 0/1 Pending 0 13m <none> <none> <none>
proxy-549fc7cb4d-qd9fm 1/1 Running 0 13m 192.168.26.96 ip-192-168-14-7.ec2.internal <none>
kubectl describe pod hub-7bcfb9b656-nd4qt --namespace nasa-prod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12m (x5 over 12m) default-scheduler pod has unbound immediate PersistentVolumeClaims
Normal NotTriggerScaleUp 2m18s (x60 over 12m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match node selector
Warning FailedScheduling 20s (x22 over 12m) default-scheduler 0/3 nodes are available: 1 node(s) had volume node affinity conflict, 2 node(s) didn't match node selector.
kubectl describe node ip-192-168-42-166.ec2.internal
Kubelet Version: v1.12.7
Kube-Proxy Version: v1.12.7
ProviderID: aws:///us-east-1d/i-0141530ac9c89f6b9
Non-terminated Pods: (9 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system aws-node-pl5tz 10m (0%) 0 (0%) 0 (0%) 0 (0%) 28h
kube-system cluster-autoscaler-85896b958b-w6ctt 100m (5%) 100m (5%) 300Mi (3%) 300Mi (3%) 24h
kube-system coredns-66bb8d6fdc-tlz8s 100m (5%) 0 (0%) 70Mi (0%) 170Mi (2%) 28h
kube-system coredns-66bb8d6fdc-wznkm 100m (5%) 0 (0%) 70Mi (0%) 170Mi (2%) 28h
kube-system kube-proxy-vgq6n 100m (5%) 0 (0%) 0 (0%) 0 (0%) 28h
kube-system tiller-deploy-7b4c999868-tgxzk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 25h
nasa-staging autohttps-5c855b856b-566mt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 25h
nasa-staging hub-6d9f5c9bf8-kmj4b 500m (25%) 1250m (62%) 1Gi (13%) 1Gi (13%) 87m
nasa-staging proxy-754bb98c67-49bvv 200m (10%) 0 (0%) 512Mi (6%) 0 (0%) 25h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1110m (55%) 1350m (67%)
memory 1976Mi (26%) 1664Mi (21%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events: <none>
although it seems like our hub should fit on there based on the requested resources... https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/nasa/config/common.yaml#L44
also the nasa-prod proxy pod (proxy-549fc7cb4d-qd9fm) went onto a user-notebook pod...
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system aws-node-k5w82 10m (0%) 0 (0%) 0 (0%) 0 (0%) 5h10m
kube-system kube-proxy-mjdn7 100m (1%) 0 (0%) 0 (0%) 0 (0%) 5h10m
nasa-prod proxy-549fc7cb4d-qd9fm 200m (2%) 0 (0%) 512Mi (1%) 0 (0%) 25m
nasa-staging jupyter-scottyhq 3 (37%) 4 (50%) 15032385536 (45%) 17179869184 (52%) 99m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3310m (41%) 4 (50%)
memory 14848Mi (47%) 17179869184 (52%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events: <none>
ok, so the situation is that the hub tries to go onto the existing instance that is in us-east-1d
, but tries to attach nasa-prod/hub-db-dir
which was created in us-east-1f
:( so with the current setup we are just lucky when these things end up in the same region. here is a potential solution https://stackoverflow.com/questions/51946393/kubernetes-pod-warning-1-nodes-had-volume-node-affinity-conflict. or perhaps there is way to put hub-db-dir
on the efs drive?
kubectl describe pv pvc-3b17f1f2-6623-11e9-aa8d-024800b3463a
Name: pvc-3b17f1f2-6623-11e9-aa8d-024800b3463a
Labels: failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1f
Annotations: kubernetes.io/createdby: aws-ebs-dynamic-provisioner
pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers: [kubernetes.io/pv-protection]
StorageClass: gp2
Status: Bound
Claim: nasa-prod/hub-db-dir
Reclaim Policy: Delete
Access Modes: RWO
Capacity: 1Gi
Node Affinity:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/zone in [us-east-1f]
failure-domain.beta.kubernetes.io/region in [us-east-1]
Message:
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://us-east-1f/vol-08117fc164f9ee9ca
FSType: ext4
Partition: 0
ReadOnly: false
Events: <none>
I'd recommend not putting the hub pod on the EFS drive, since sqlite and NFS don't mix very well.
Not sure why EKS / kops is putting EBS volumes in a different zone than the k8s cluster :(
Thanks for the feedback @yuvipanda! We are currently running staging.nasa.pangeo.io entirely on EFS. So We'll see if there are issues there with the Hub database that come up... I ended up setting up our 'default' provisioner to https://github.com/kubernetes-incubator/external-storage/tree/master/aws/efs (instead of EBS/gp2).
It seems like the EBS going into different zones is a well known issue. Since EKS added Kuberenetes >1.12 recently. we can now use "topology-aware volume provisioning" if need be
@scottyhq cool! with sqlite on NFS, the thing to watch out for is times when hub response times spike up, often to something like 1 - 5seconds per request, cascading pretty badly quickly. This hit us at Berkeley earlier, and the 'fix' was to move off shared storage. Otherwise, for the kinda workload we have it works fine.
In one of our jupyterhub deployments (running on eks), we've been getting semi-regular
helm upgrade
timeouts when using hubploy:Would it be possible to add the
--timeout
option to hubploy's upgrade call? I believe the default is 300s.