dandi / dandi-hub

Infrastructure and code for the dandihub
https://hub.dandiarchive.org
Other
11 stars 23 forks source link

Unpin Karpenter controller pod #207

Open asmacdo opened 1 day ago

asmacdo commented 1 day ago

Investigate Karpenter 0.37

When redeploying LINC, Karpenter failed to bring up a new node, which presents as a user-pod launch to hang.

Useful debug information was scarce, but describing the nodeclaim and getting the logs of the Karpenter pod did indicate that something was going wrong. (The exact error messages aren't recorded, but weren't very helpful.)

Currently, on Dandihub we are using Karpenter controller image 0.35.0, but LINC was using 0.37.

We patched the karpenter config in addons.tf in https://github.com/dandi/dandi-hub/pull/205/ which has temporarily mitigated the problem-- but we are now pinned and should investigate what is necessary to bring this up to 0.37 (and possibly 1.0+?)

  #---------------------------------------
  # Karpenter Autoscaler for EKS Cluster
  #---------------------------------------
  enable_karpenter                  = true
  karpenter_enable_spot_termination = true
  karpenter = {
    timeout             = "300"
    repository_username = data.aws_ecrpublic_authorization_token.token.user_name
    repository_password = data.aws_ecrpublic_authorization_token.token.password
    values = [<<EOT
        controller:
          image: public.ecr.aws/karpenter/controller:0.35.0@sha256:48d1246f6b2066404e300cbf3e26d0bcdc57a76531dcb634d571f4f0e050cb57
    EOT
    ]
  }