berkeley-dsep-infra / hubploy

Toolkit to deploy many z2jh based JupyterHubs
BSD 3-Clause "New" or "Revised" License
16 stars 15 forks source link

increase default timeout in helm upgrade command #17

Open jhamman opened 5 years ago

jhamman commented 5 years ago

In one of our jupyterhub deployments (running on eks), we've been getting semi-regular helm upgrade timeouts when using hubploy:

Downloading pangeo from repo https://pangeo-data.github.io/helm-chart/
Deleting outdated charts
Release "icesat2-prod" does not exist. Installing it now.
Error: release icesat2-prod failed: timed out waiting for the condition
Traceback (most recent call last):
  File "/home/circleci/repo/venv/bin/hubploy", line 11, in <module>
    load_entry_point('hubploy==0.1.0', 'console_scripts', 'hubploy')()
  File "/home/circleci/repo/venv/lib/python3.7/site-packages/hubploy/__main__.py", line 65, in main
    helm.deploy(args.deployment, args.chart, args.environment, args.namespace, args.set, args.version)
  File "/home/circleci/repo/venv/lib/python3.7/site-packages/hubploy/helm.py", line 128, in deploy
    version
  File "/home/circleci/repo/venv/lib/python3.7/site-packages/hubploy/helm.py", line 58, in helm_upgrade
    subprocess.check_call(cmd)
  File "/usr/local/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['helm', 'upgrade', '--wait', '--install', '--namespace', 'icesat2-prod', 'icesat2-prod', 'pangeo-deploy', '-f', 'deployments/icesat2/config/common.yaml', '-f', 'deployments/icesat2/config/prod.yaml', '-f', 'deployments/icesat2/secrets/prod.yaml', '--set', 'pangeo.jupyterhub.singleuser.image.tag=2035c99', '--set', 'pangeo.jupyterhub.singleuser.image.name=783380859522.dkr.ecr.us-west-2.amazonaws.com/pangeo']' returned non-zero exit status 1.
Exited with code 1

Would it be possible to add the --timeout option to hubploy's upgrade call? I believe the default is 300s.

yuvipanda commented 5 years ago

@jhamman I guess this is because the --wait makes it wait until the loadbalancer IP has become available. Makes sense to bump it - perhaps as an option in hubploy.yaml?

jhamman commented 5 years ago

@scottyhq - have you still been getting timeouts? I think things have actually stabilized...

scottyhq commented 5 years ago

we ran into this same error again with a new deployment https://circleci.com/gh/pangeo-data/pangeo-cloud-federation/480

our manual work-around is documented here https://github.com/pangeo-data/pangeo-cloud-federation/pull/207

scottyhq commented 5 years ago

digging in a bit further this second time, it seems like the hub pod is stuck in pending b/c of the resources available (we are running two hubs - staging, prod - on a single m5.large instance). Here are some relevant commands and output

kubectl get pods --namespace nasa-prod -o wide

(aws) scott@pangeo-cloud-federation:kubectl get pods --namespace nasa-prod -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP              NODE                           NOMINATED NODE
hub-7bcfb9b656-nd4qt     0/1     Pending   0          13m   <none>          <none>                         <none>
proxy-549fc7cb4d-qd9fm   1/1     Running   0          13m   192.168.26.96   ip-192-168-14-7.ec2.internal   <none>

kubectl describe pod hub-7bcfb9b656-nd4qt --namespace nasa-prod

Events:
  Type     Reason             Age                   From                Message
  ----     ------             ----                  ----                -------
  Warning  FailedScheduling   12m (x5 over 12m)     default-scheduler   pod has unbound immediate PersistentVolumeClaims
  Normal   NotTriggerScaleUp  2m18s (x60 over 12m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match node selector
  Warning  FailedScheduling   20s (x22 over 12m)    default-scheduler   0/3 nodes are available: 1 node(s) had volume node affinity conflict, 2 node(s) didn't match node selector.

kubectl describe node ip-192-168-42-166.ec2.internal

 Kubelet Version:            v1.12.7
 Kube-Proxy Version:         v1.12.7
ProviderID:                  aws:///us-east-1d/i-0141530ac9c89f6b9
Non-terminated Pods:         (9 in total)
  Namespace                  Name                                   CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
  ---------                  ----                                   ------------  ----------   ---------------  -------------  ---
  kube-system                aws-node-pl5tz                         10m (0%)      0 (0%)       0 (0%)           0 (0%)         28h
  kube-system                cluster-autoscaler-85896b958b-w6ctt    100m (5%)     100m (5%)    300Mi (3%)       300Mi (3%)     24h
  kube-system                coredns-66bb8d6fdc-tlz8s               100m (5%)     0 (0%)       70Mi (0%)        170Mi (2%)     28h
  kube-system                coredns-66bb8d6fdc-wznkm               100m (5%)     0 (0%)       70Mi (0%)        170Mi (2%)     28h
  kube-system                kube-proxy-vgq6n                       100m (5%)     0 (0%)       0 (0%)           0 (0%)         28h
  kube-system                tiller-deploy-7b4c999868-tgxzk         0 (0%)        0 (0%)       0 (0%)           0 (0%)         25h
  nasa-staging               autohttps-5c855b856b-566mt             0 (0%)        0 (0%)       0 (0%)           0 (0%)         25h
  nasa-staging               hub-6d9f5c9bf8-kmj4b                   500m (25%)    1250m (62%)  1Gi (13%)        1Gi (13%)      87m
  nasa-staging               proxy-754bb98c67-49bvv                 200m (10%)    0 (0%)       512Mi (6%)       0 (0%)         25h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1110m (55%)   1350m (67%)
  memory                      1976Mi (26%)  1664Mi (21%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:                       <none>

although it seems like our hub should fit on there based on the requested resources... https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/nasa/config/common.yaml#L44

scottyhq commented 5 years ago

also the nasa-prod proxy pod (proxy-549fc7cb4d-qd9fm) went onto a user-notebook pod...

  Namespace                  Name                      CPU Requests  CPU Limits  Memory Requests    Memory Limits      AGE
  ---------                  ----                      ------------  ----------  ---------------    -------------      ---
  kube-system                aws-node-k5w82            10m (0%)      0 (0%)      0 (0%)             0 (0%)             5h10m
  kube-system                kube-proxy-mjdn7          100m (1%)     0 (0%)      0 (0%)             0 (0%)             5h10m
  nasa-prod                  proxy-549fc7cb4d-qd9fm    200m (2%)     0 (0%)      512Mi (1%)         0 (0%)             25m
  nasa-staging               jupyter-scottyhq          3 (37%)       4 (50%)     15032385536 (45%)  17179869184 (52%)  99m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         3310m (41%)    4 (50%)
  memory                      14848Mi (47%)  17179869184 (52%)
  ephemeral-storage           0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0
Events:                       <none>
scottyhq commented 5 years ago

ok, so the situation is that the hub tries to go onto the existing instance that is in us-east-1d, but tries to attach nasa-prod/hub-db-dir which was created in us-east-1f :( so with the current setup we are just lucky when these things end up in the same region. here is a potential solution https://stackoverflow.com/questions/51946393/kubernetes-pod-warning-1-nodes-had-volume-node-affinity-conflict. or perhaps there is way to put hub-db-dir on the efs drive?

kubectl describe pv pvc-3b17f1f2-6623-11e9-aa8d-024800b3463a

Name:              pvc-3b17f1f2-6623-11e9-aa8d-024800b3463a
Labels:            failure-domain.beta.kubernetes.io/region=us-east-1
                   failure-domain.beta.kubernetes.io/zone=us-east-1f
Annotations:       kubernetes.io/createdby: aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      gp2
Status:            Bound
Claim:             nasa-prod/hub-db-dir
Reclaim Policy:    Delete
Access Modes:      RWO
Capacity:          1Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-east-1f]
                   failure-domain.beta.kubernetes.io/region in [us-east-1]
Message:           
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-east-1f/vol-08117fc164f9ee9ca
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>
yuvipanda commented 5 years ago

I'd recommend not putting the hub pod on the EFS drive, since sqlite and NFS don't mix very well.

Not sure why EKS / kops is putting EBS volumes in a different zone than the k8s cluster :(

scottyhq commented 5 years ago

Thanks for the feedback @yuvipanda! We are currently running staging.nasa.pangeo.io entirely on EFS. So We'll see if there are issues there with the Hub database that come up... I ended up setting up our 'default' provisioner to https://github.com/kubernetes-incubator/external-storage/tree/master/aws/efs (instead of EBS/gp2).

It seems like the EBS going into different zones is a well known issue. Since EKS added Kuberenetes >1.12 recently. we can now use "topology-aware volume provisioning" if need be

yuvipanda commented 5 years ago

@scottyhq cool! with sqlite on NFS, the thing to watch out for is times when hub response times spike up, often to something like 1 - 5seconds per request, cascading pretty badly quickly. This hit us at Berkeley earlier, and the 'fix' was to move off shared storage. Otherwise, for the kinda workload we have it works fine.