dandi / dandi-hub

Infrastructure and code for the dandihub
https://hub.dandiarchive.org
Other
11 stars 24 forks source link

Termination handler fails to install #52

Open portega-inbrain opened 1 year ago

portega-inbrain commented 1 year ago

When deploying the DandiHub from scratch in a new AWS account following the instructions in the repo I get the following error.

Release "aws-node-termination-handler" does not exist. Installing it now.
Error: unable to build kubernetes objects from release manifest: resource mapping not found
for name: "aws-node-termination-handler" namespace: "" from "":
no matches for kind "PodSecurityPolicy" in version "policy/v1beta1"
ensure CRDs are installed first

This error traces to the installation of aws-node-termination-handler in z2jhl.yml, i.e.

  - name: Add termination handler
    shell: helm upgrade --install aws-node-termination-handler \
      --namespace kube-system \
      eks/aws-node-termination-handler

Digging a little bit into the error message, in particular into no matches for kind "PodSecurityPolicy" in version "policy/v1beta1" I found that Kubernetes has deprecated v1beta1:

The policy/v1beta1 API version of PodDisruptionBudget will no longer be served in v1.25.

    Migrate manifests and API clients to use the policy/v1 API version, available since v1.21.
    All existing persisted objects are accessible via the new API
    Notable changes in policy/v1:
        an empty spec.selector ({}) written to a policy/v1 PodDisruptionBudget selects all pods in the namespace
         (in policy/v1beta1 an empty spec.selector selected no pods). An unset spec.selector selects
          no pods in either API version.

PodSecurityPolicy

PodSecurityPolicy in the policy/v1beta1 API version will no longer be served in v1.25, and the
 PodSecurityPolicy admission controller will be removed.

Migrate to Pod Security Admission or a 3rd party admission webhook. For a migration guide, 
see Migrate from PodSecurityPolicy to the Built-In PodSecurity Admission Controller. 
For more information on the deprecation, see PodSecurityPolicy Deprecation: Past, Present, and Future.

Also see: PodDisruptionBudget, PodSecurityPolicy in https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-25.

Apparently the kubectl version installed when running the z2jh.yml instruction for downloading kubectl is v1.25. I.e. this instruction in z2jh.yml

wget -O kubectl https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl

leads to the installation of kubectl v1.25 which following the previous links discontinued support for v1beta1.

On the cluster end, if I run all commands and get it running without installing aws-node-termination-handler I still get some errors due to the deprecation of the policy. In particular, this is what I get in the user-scheduler which might be just a consecuence of not having the helm chart for aws-node-termination-handler:

E1206 10:07:45.986565       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1beta1.PodDisruptionBudget: failed to list *v1beta1.PodDisruptionBudget: the server could not find the requested resource

UPDATE 1 This issue can be narrowed down to the rbac.pspEnabled option in the helm chart since running

helm install --namespace kube-system aws-node-termination-handler eks/aws-node-termination-handler --set rbac.pspEnabled=false

leads to a successful installation. Reviewing the chart config I can see that there is still a dependency with policy/v1beta1 in https://github.com/aws/eks-charts/blob/8e82f74d75221964d604d3c7b8c70da10160b88e/stable/aws-node-termination-handler/templates/psp.yaml#L2.

UPDATE 2 Changing https://github.com/aws/eks-charts/blob/8e82f74d75221964d604d3c7b8c70da10160b88e/stable/aws-node-termination-handler/templates/psp.yaml#L2. from policy/v1beta to policy/v1 still leads to the same error.

satra commented 1 year ago

thanks so much for the sleuthing. much appreciated. to confirm, you have a running system with node termination ?

we can temporarily add that --set rbac.pspEnabled=false to our helm call with a note to remove it perhaps once aws adjusts theirs. would you want to send a PR?

also, feel free to send any PRs in general if it helps improve deployment. this started as a hack and turned out to be a relatively simple way to control the setup. we still need to tie it in to github actions, so that we can deploy more automatically. as you may have seen there are a few manual pieces to do.

portega-inbrain commented 1 year ago

I created a PR for this (linked above). However I'd consider this a hack more than anything else. I'm still having problems reproducing the architecture since I am unable to spawn pods. I can login and select the pods but even the tiny pod times out in creation. I think this is something related to the the default notebook selected since for a brief time while the pod is trying to spawn I can read the log of the jupyter-user pod created and see:

Defaulted container "notebook" out of: notebook, nfs-fixer (init), block-cloud-metadata (init)

Unfortunately, there aren't events in the pod description that can help me debug this.

Name:                 jupyter-portega
Namespace:            coddhub
Priority:             0
Priority Class Name:  coddhub-jupyterhub-default-priority
Runtime Class Name:   nvidia
Service Account:      default
Node:                 <none>
Labels:               app=jupyterhub
                      chart=jupyterhub-1.2.0
                      component=singleuser-server
                      heritage=jupyterhub
                      hub.jupyter.org/network-access-hub=true
                      hub.jupyter.org/servername=
                      hub.jupyter.org/username=portega
                      release=coddhub-jupyterhub
Annotations:          hub.jupyter.org/username: portega
Status:               Pending
IP:
IPs:                  <none>
Init Containers:
  nfs-fixer:
    Image:      alpine
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      chmod 0775 /nfs; chown 1000:100 /nfs; chmod 0775 /shared; chown 1000:100 /shared

    Environment:  <none>
    Mounts:
      /nfs from persistent-storage (rw,path="home/portega")
      /shared from persistent-storage (rw,path="shared")
  block-cloud-metadata:
    Image:      jupyterhub/k8s-network-tools:1.2.0
    Port:       <none>
    Host Port:  <none>
    Command:
      iptables
      -A
      OUTPUT
      -d
      169.254.169.254
      -j
      DROP
    Environment:  <none>
    Mounts:       <none>
Containers:
  notebook:
    Image:      dandiarchive/dandihub:latest-gpu
    Port:       8888/TCP
    Host Port:  0/TCP
    Args:
      start-singleuser.sh
      --ip=0.0.0.0
      --port=8888
      --SingleUserNotebookApp.default_url=/lab
      --debug
    Limits:
      cpu:             4
      memory:          16106127360
      nvidia.com/gpu:  1
    Requests:
      cpu:             2
      memory:          10737418240
      nvidia.com/gpu:  1
    Environment:
      CPU_GUARANTEE:                  2.0
      CPU_LIMIT:                      4.0
      JPY_API_TOKEN:                  f28470a075e145688b53e684d3c0d1c1
      JUPYTERHUB_ACTIVITY_URL:        http://hub:8081/hub/api/users/portega/activity
      JUPYTERHUB_API_TOKEN:           f28470a075e145688b53e684d3c0d1c1
      JUPYTERHUB_API_URL:             http://hub:8081/hub/api
      JUPYTERHUB_BASE_URL:            /
      JUPYTERHUB_CLIENT_ID:           jupyterhub-user-portega
      JUPYTERHUB_HOST:
      JUPYTERHUB_OAUTH_CALLBACK_URL:  /user/portega/oauth_callback
      JUPYTERHUB_SERVER_NAME:
     JUPYTERHUB_SERVICE_PREFIX:      /user/portega/
      JUPYTERHUB_USER:                portega
      JUPYTER_IMAGE:                  dandiarchive/dandihub:latest-gpu
      JUPYTER_IMAGE_SPEC:             dandiarchive/dandihub:latest-gpu
      MEM_GUARANTEE:                  10737418240
      MEM_LIMIT:                      16106127360
    Mounts:
      /dev/fuse from fuse (rw)
      /dev/shm from shm-volume (rw)
      /home/portega from persistent-storage (rw,path="home/portega")
      /shared from persistent-storage (rw,path="shared")
Volumes:
  fuse:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/fuse
    HostPathType:
  shm-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  persistent-storage:
    Type:        PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:   efs-claim
    ReadOnly:    false
QoS Class:       Burstable
*** Node-Selectors:  kops.k8s.io/gpu=1
Tolerations:     hub.jupyter.org/dedicated=user:NoSchedule
                 hub.jupyter.org_dedicated=user:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                 nvidia.com/gpu:NoSchedule op=Exists
Events:          <none>
portega-inbrain commented 1 year ago

Sorry for the delay in my reply! I was finally able to spawn pods by updating the jupyterhub-chart here

https://github.com/dandi/dandi-hub/blob/205417f3706d08fb2bb1261efd477a19d536bd5a/dandi-info/group_vars/all#L12

I used jupyterhub_chart_version: 2.0.0 and it works perfectly.

I then had problems spawning particular instances. But this was related to an AWS limit on the amount of EC2 instances you can spawn by default. I requested an increase of the quota and it all worked.

I think the default parameters in the ansible configuration are deemed for a pretty big cluster. My team is smaller so I also decreased these parameters. It is probably worth updating the docs regarding this.

Leaving this open @satra. Let me know if you want me to create a PR with changes on that line.