dandi / dandi-hub

Infrastructure and code for the dandihub
https://hub.dandiarchive.org
Other
11 stars 23 forks source link

check autoscaling both increase and decrease #129

Closed satra closed 7 months ago

satra commented 8 months ago

need to verify the time characteristics of spinning up and spinning down. the do_eks implementation doesn't appear to be using autoscaling groups. would be good to figure out how autoscaling happens in practice (what is the magic of karpenter).

asmacdo commented 8 months ago

Heres what happens during spin up:

The hub creates a Pod with a nodeSelector of gpu or default.

# We don't have enough nodes already spun up. And theres nothing we can boot either.
Warning  FailedScheduling   55s   jupyterhub-user-scheduler  0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..

# Karpenter magic: Under the hood a nodeclaim is created, which is a Karpenter CRD.
# In response, Karpenter interacts with AWS, creates a new machine and registers it as a K8s Node
2024-03-26T15:15:17Z [Normal] Pod should schedule on: nodeclaim/default-55cwr

# Dont be fooled by this message from the cluster-autoscaler. We don't want the pod to trigger cluster-autoscaler scale-up, we are using Karpenter
2024-03-26T15:15:24Z [Normal] pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector

# Now k8s behaves normally, the pod has been assigned to a Node
2024-03-26T15:16:03.407813Z [Normal] Successfully assigned jupyterhub/jupyter-asmacdo to ip-100-64-16-100.us-west-1.compute.internalNormal   
asmacdo commented 7 months ago

2024 DandiHub Refactor-Scale Up drawio 2024 DandiHub Refactor-Scale Down drawio

Everything appears to work as expected, but we should observe at scale to be sure as part of the testing issues. Closing as completed.