When redeploying LINC, Karpenter failed to bring up a new node, which presents as a user-pod launch to hang.
Useful debug information was scarce, but describing the nodeclaim and getting the logs of the Karpenter pod did indicate that something was going wrong. (The exact error messages aren't recorded, but weren't very helpful.)
Currently, on Dandihub we are using Karpenter controller image 0.35.0, but LINC was using 0.37.
We patched the karpenter config in addons.tf in https://github.com/dandi/dandi-hub/pull/205/ which has temporarily mitigated the problem-- but we are now pinned and should investigate what is necessary to bring this up to 0.37 (and possibly 1.0+?)
Investigate Karpenter 0.37
When redeploying LINC, Karpenter failed to bring up a new node, which presents as a user-pod launch to hang.
Useful debug information was scarce, but describing the
nodeclaim
and getting the logs of the Karpenter pod did indicate that something was going wrong. (The exact error messages aren't recorded, but weren't very helpful.)Currently, on Dandihub we are using Karpenter controller image 0.35.0, but LINC was using 0.37.
We patched the karpenter config in
addons.tf
in https://github.com/dandi/dandi-hub/pull/205/ which has temporarily mitigated the problem-- but we are now pinned and should investigate what is necessary to bring this up to 0.37 (and possibly 1.0+?)