aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.6k stars 918 forks source link

error - missing required startup taint, karpenter.sh/unregistered #6821

Open ArieLevs opened 3 weeks ago

ArieLevs commented 3 weeks ago

Description

Observed Behavior: Karpenter nodeclaim READY state are always False while should be True When using Karpetner 0.37.1 all working perfectly well, after migrating to 1.0.0 (or trying to install 1.0.0 from scratch), contoller results in the error of

{
  "level": "ERROR",
  "time": "2024-08-21T12:03:10.958Z",
  "logger": "controller",
  "caller": "controller/controller.go:261",
  "message": "Reconciler error",
  "commit": "5bdf9c3",
  "controller": "nodeclaim.lifecycle",
  "controllerGroup": "karpenter.sh",
  "controllerKind": "NodeClaim",
  "NodeClaim": {
    "name": "default-v64qz"
  },
  "namespace": "",
  "name": "default-v64qz",
  "reconcileID": "30d5bee9-399d-42fa-87a0-533677cbb908",
  "error": "missing required startup taint, karpenter.sh/unregistered"
}

this is the only error i get from the controller

On the nodeclaim i see this condition

Conditions:
  ...
  ...
  Last Transition Time:  2024-08-21T12:13:15Z
  Message:               Registered=False
  Reason:                UnhealthyDependents
  Status:                False
  Type:                  Ready
  Last Transition Time:  2024-08-21T12:13:01Z
  Message:               Invariant violated, karpenter.sh/unregistered taint must be present on Karpenter-managed nodes
  Reason:                UnregisteredTaintNotFound
  Status:                False
  Type:                  Registered

Expected Behavior: Karpenter nodeclaim READY state are always False while should be `True,
reverting to 0.37.1 make all work normally again

this happens to me even from a scratch setup, i assume karpenter 1.0.0 works for other users?

Reproduction Steps (Please include YAML): Install Karpetner v1.0.0 from scratch or migrate 0.37.1 to 1.0.0

Versions:

jmdeal commented 3 weeks ago

What does your EC2NodeClass look like? Are you using the Custom AMI family? If so you need to ensure your userdata configures the kubelet to register your nodes with the karpenter.sh/unregistered:true taint (v1 migration guide). If you're not using Custom this taint should be added automatically.

ArieLevs commented 3 weeks ago

thanks @jmdeal

I'm not using the Custom family, using the AL2, but, I do set the userData block since historically I had to update the 99-kubernetes-cri.conf file, after adding the taint / or actually totally removing the userData block, all worked as expected 🥳

I've found the migration document bit misleading as they state that the taint should be added only when Custom family used, not when the userData block used in general. I've prepared a change for the documentation https://github.com/aws/karpenter-provider-aws/commit/f92c3167da5a1435af67585aff19bc22dc434627,
but while testing this, I had to add the taint in my userData block when using the AL2 type, but not when using AL2023 so I'm a bit confused here

jmdeal commented 3 weeks ago

You shouldn't need to add the taint via UserData if you're using any of the managed AMI families, including AL2. Are you able to share what your UserData looks like?

ArieLevs commented 2 weeks ago

sure, the userdata.sh file content is

#!/bin/bash -e

sysctl -p /etc/sysctl.d/99-kubernetes-cri.conf

# Bootstrap and join the cluster
/etc/eks/bootstrap.sh --b64-cluster-ca '${cluster_auth_base64}' --apiserver-endpoint '${endpoint}' ${bootstrap_extra_args} --kubelet-extra-args "${kubelet_extra_args}" '${cluster_name}'

and the node class is

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
  labels:
    ...
spec:
  amiFamily: AL2
  ...
  userData: |
    ${templated_user_data}
jmdeal commented 2 weeks ago

If you're calling the bootstrap script directly you should be using the custom AMI family, not AL2. Is there any reason you need to call bootstrap yourself, rather than leveraging Karpenter's generated UserData? You would still be able to include your cri file update in your UserData, just not the bootstrap script. More details on how AL2 UserData is merged can be found in the docs.

ArieLevs commented 2 weeks ago

Originally we used AL2 without any special overrides, from the docs back in the day we had to add a dedicated user data (for all kind of extra security executions). now we ended up using just AL2 without any userData custom block.

I mean its all working, if using AL2/AL2023 without userData block, or Custom with userData and the karpenter.sh/unregistered its all Ok, just that the documentarians states to add in only when using Custom family, but this seems to be true for AL2 as well

jmdeal commented 2 weeks ago

It's only true if you're calling the bootstrap script yourself in the UserData. I think the action we should take here is to be explicit that if you need to call the bootstrap script yourself, you need to use the custom AMI family rather than AL2.

flavono123 commented 3 days ago

my case is set both amiFamily and amiSelectorTerms to use FULL windows ami:

spec:
  amiFamily: Windows2019
  amiSelectorTerms:
    - name: Windows_Server-2019-English-Full-EKS_Optimized-*

do i need to taint with user data? (i guessed the above type of ami is not a "custom")

a dup inquiry in slack

flavono123 commented 3 days ago

https://github.com/aws/karpenter-provider-aws/issues/6821#issuecomment-2337202135

register-with-taint in user data resolves this issue, btw

jmdeal commented 2 days ago

You shouldn't need to, as long as you're relying on Karpenter's call to the bootstrap script. It will automatically add the register-with-taint flag to it's invocation.