Closed satra closed 7 months ago
Heres what happens during spin up:
The hub creates a Pod with a nodeSelector
of gpu
or default
.
# We don't have enough nodes already spun up. And theres nothing we can boot either.
Warning FailedScheduling 55s jupyterhub-user-scheduler 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
# Karpenter magic: Under the hood a nodeclaim is created, which is a Karpenter CRD.
# In response, Karpenter interacts with AWS, creates a new machine and registers it as a K8s Node
2024-03-26T15:15:17Z [Normal] Pod should schedule on: nodeclaim/default-55cwr
# Dont be fooled by this message from the cluster-autoscaler. We don't want the pod to trigger cluster-autoscaler scale-up, we are using Karpenter
2024-03-26T15:15:24Z [Normal] pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
# Now k8s behaves normally, the pod has been assigned to a Node
2024-03-26T15:16:03.407813Z [Normal] Successfully assigned jupyterhub/jupyter-asmacdo to ip-100-64-16-100.us-west-1.compute.internalNormal
Everything appears to work as expected, but we should observe at scale to be sure as part of the testing issues. Closing as completed.
need to verify the time characteristics of spinning up and spinning down. the do_eks implementation doesn't appear to be using autoscaling groups. would be good to figure out how autoscaling happens in practice (what is the magic of karpenter).