Open portega-inbrain opened 1 year ago
thanks so much for the sleuthing. much appreciated. to confirm, you have a running system with node termination ?
we can temporarily add that --set rbac.pspEnabled=false
to our helm call with a note to remove it perhaps once aws adjusts theirs. would you want to send a PR?
also, feel free to send any PRs in general if it helps improve deployment. this started as a hack and turned out to be a relatively simple way to control the setup. we still need to tie it in to github actions, so that we can deploy more automatically. as you may have seen there are a few manual pieces to do.
I created a PR for this (linked above). However I'd consider this a hack more than anything else. I'm still having problems reproducing the architecture since I am unable to spawn pods. I can login and select the pods but even the tiny pod times out in creation. I think this is something related to the the default notebook selected since for a brief time while the pod is trying to spawn I can read the log of the jupyter-user pod created and see:
Defaulted container "notebook" out of: notebook, nfs-fixer (init), block-cloud-metadata (init)
Unfortunately, there aren't events in the pod description that can help me debug this.
Name: jupyter-portega
Namespace: coddhub
Priority: 0
Priority Class Name: coddhub-jupyterhub-default-priority
Runtime Class Name: nvidia
Service Account: default
Node: <none>
Labels: app=jupyterhub
chart=jupyterhub-1.2.0
component=singleuser-server
heritage=jupyterhub
hub.jupyter.org/network-access-hub=true
hub.jupyter.org/servername=
hub.jupyter.org/username=portega
release=coddhub-jupyterhub
Annotations: hub.jupyter.org/username: portega
Status: Pending
IP:
IPs: <none>
Init Containers:
nfs-fixer:
Image: alpine
Port: <none>
Host Port: <none>
Command:
sh
-c
chmod 0775 /nfs; chown 1000:100 /nfs; chmod 0775 /shared; chown 1000:100 /shared
Environment: <none>
Mounts:
/nfs from persistent-storage (rw,path="home/portega")
/shared from persistent-storage (rw,path="shared")
block-cloud-metadata:
Image: jupyterhub/k8s-network-tools:1.2.0
Port: <none>
Host Port: <none>
Command:
iptables
-A
OUTPUT
-d
169.254.169.254
-j
DROP
Environment: <none>
Mounts: <none>
Containers:
notebook:
Image: dandiarchive/dandihub:latest-gpu
Port: 8888/TCP
Host Port: 0/TCP
Args:
start-singleuser.sh
--ip=0.0.0.0
--port=8888
--SingleUserNotebookApp.default_url=/lab
--debug
Limits:
cpu: 4
memory: 16106127360
nvidia.com/gpu: 1
Requests:
cpu: 2
memory: 10737418240
nvidia.com/gpu: 1
Environment:
CPU_GUARANTEE: 2.0
CPU_LIMIT: 4.0
JPY_API_TOKEN: f28470a075e145688b53e684d3c0d1c1
JUPYTERHUB_ACTIVITY_URL: http://hub:8081/hub/api/users/portega/activity
JUPYTERHUB_API_TOKEN: f28470a075e145688b53e684d3c0d1c1
JUPYTERHUB_API_URL: http://hub:8081/hub/api
JUPYTERHUB_BASE_URL: /
JUPYTERHUB_CLIENT_ID: jupyterhub-user-portega
JUPYTERHUB_HOST:
JUPYTERHUB_OAUTH_CALLBACK_URL: /user/portega/oauth_callback
JUPYTERHUB_SERVER_NAME:
JUPYTERHUB_SERVICE_PREFIX: /user/portega/
JUPYTERHUB_USER: portega
JUPYTER_IMAGE: dandiarchive/dandihub:latest-gpu
JUPYTER_IMAGE_SPEC: dandiarchive/dandihub:latest-gpu
MEM_GUARANTEE: 10737418240
MEM_LIMIT: 16106127360
Mounts:
/dev/fuse from fuse (rw)
/dev/shm from shm-volume (rw)
/home/portega from persistent-storage (rw,path="home/portega")
/shared from persistent-storage (rw,path="shared")
Volumes:
fuse:
Type: HostPath (bare host directory volume)
Path: /dev/fuse
HostPathType:
shm-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
persistent-storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: efs-claim
ReadOnly: false
QoS Class: Burstable
*** Node-Selectors: kops.k8s.io/gpu=1
Tolerations: hub.jupyter.org/dedicated=user:NoSchedule
hub.jupyter.org_dedicated=user:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events: <none>
Sorry for the delay in my reply! I was finally able to spawn pods by updating the jupyterhub-chart here
I used jupyterhub_chart_version: 2.0.0
and it works perfectly.
I then had problems spawning particular instances. But this was related to an AWS limit on the amount of EC2 instances you can spawn by default. I requested an increase of the quota and it all worked.
I think the default parameters in the ansible configuration are deemed for a pretty big cluster. My team is smaller so I also decreased these parameters. It is probably worth updating the docs regarding this.
Leaving this open @satra. Let me know if you want me to create a PR with changes on that line.
When deploying the DandiHub from scratch in a new AWS account following the instructions in the repo I get the following error.
This error traces to the installation of
aws-node-termination-handler
inz2jhl.yml
, i.e.Digging a little bit into the error message, in particular into
no matches for kind "PodSecurityPolicy" in version "policy/v1beta1"
I found that Kubernetes has deprecatedv1beta1
:Also see: PodDisruptionBudget, PodSecurityPolicy in https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-25.
Apparently the
kubectl
version installed when running thez2jh.yml
instruction for downloadingkubectl
isv1.25
. I.e. this instruction inz2jh.yml
leads to the installation of
kubectl v1.25
which following the previous links discontinued support forv1beta1
.On the cluster end, if I run all commands and get it running without installing
aws-node-termination-handler
I still get some errors due to the deprecation of the policy. In particular, this is what I get in theuser-scheduler
which might be just a consecuence of not having the helm chart foraws-node-termination-handler
:UPDATE 1 This issue can be narrowed down to the
rbac.pspEnabled
option in the helm chart since runningleads to a successful installation. Reviewing the chart config I can see that there is still a dependency with
policy/v1beta1
in https://github.com/aws/eks-charts/blob/8e82f74d75221964d604d3c7b8c70da10160b88e/stable/aws-node-termination-handler/templates/psp.yaml#L2.UPDATE 2 Changing https://github.com/aws/eks-charts/blob/8e82f74d75221964d604d3c7b8c70da10160b88e/stable/aws-node-termination-handler/templates/psp.yaml#L2. from
policy/v1beta
topolicy/v1
still leads to the same error.