aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.6k stars 919 forks source link

Karpenter nodes go to Not Read state without AWS nodegroup #6627

Open mohammad-mahmoudian-dynata opened 1 month ago

mohammad-mahmoudian-dynata commented 1 month ago

Description

Observed Behavior: I have Karpenter running on EKS and if I don't have a nodegroup, the nodes become Not Ready in k8s and their status become unkonw in EKS AWS console

Expected Behavior: The karpenter needs at least one node group present but I don't node group because I am running the Karpenter and CodeDNS in fargate

Reproduction Steps (Please include YAML): Install the Karpenter using methis in this doc run it without nodegroup

Steps to install/upgrade karpenter

# Set Cluster Variables
export KARPENTER_NAMESPACE=karpenter
export KARPENTER_VERSION=v0.37.0
export AWS_PARTITION="aws"
export CLUSTER_NAME="eks-qa"
export AWS_REGION="us-east-1"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text --profile voice_dev)"
export KARPENTER_IAM_ROLE_ARN="arn:aws:iam::12222222222:role/eks-qa-karpenter-controller"
export ROLE_NAME="eks-qa-karpenter-controller"
export CLUSTER_ENDPOINT="$(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.endpoint" --output text --profile voice_dev)"
export AWS_REGION=${AWS_REGION:=$AWS_DEFAULT_REGION}
export TEMPOUT=$(mktemp)
export INSTANCE_PROFILE="eks-qa-worker-iam-instance"

# check the instance profile 
aws iam get-instance-profile --instance-profile-name eks-qa-worker-iam-instance  --profile voice_dev

echo $AWS_ACCOUNT_ID
echo $KARPENTER_IAM_ROLE_ARN
echo $CLUSTER_ENDPOINT
echo $POLICY_NAME

curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/v0.37.0/website/content/en/preview/upgrading/v1beta1-controller-policy.json > mktemp

POLICY_DOCUMENT=$(<mktemp)
POLICY_NAME="KarpenterControllerPolicy-${CLUSTER_NAME}-v1beta1"

cat << EOF > controller-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowScopedEC2InstanceActions",
      "Effect": "Allow",
      "Resource": [
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}::image/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}::snapshot/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:spot-instances-request/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:security-group/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:subnet/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:launch-template/*"
      ],
      "Action": [
        "ec2:RunInstances",
        "ec2:CreateFleet"
      ]
    },
    {
      "Sid": "AllowScopedEC2InstanceActionsWithTags",
      "Effect": "Allow",
      "Resource": [
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:fleet/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:instance/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:volume/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:network-interface/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:launch-template/*"
      ],
      "Action": [
        "ec2:RunInstances",
        "ec2:CreateFleet",
        "ec2:CreateLaunchTemplate"
      ],
      "Condition": {
        "StringEquals": {
          "aws:RequestTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned"
        },
        "StringLike": {
          "aws:RequestTag/karpenter.sh/nodepool": "*"
        }
      }
    },
    {
      "Sid": "AllowScopedResourceCreationTagging",
      "Effect": "Allow",
      "Resource": [
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:fleet/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:instance/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:volume/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:network-interface/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:launch-template/*"
      ],
      "Action": "ec2:CreateTags",
      "Condition": {
        "StringEquals": {
          "aws:RequestTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned",
          "ec2:CreateAction": [
            "RunInstances",
            "CreateFleet",
            "CreateLaunchTemplate"
          ]
        },
        "StringLike": {
          "aws:RequestTag/karpenter.sh/nodepool": "*"
        }
      }
    },
    {
      "Sid": "AllowScopedResourceTagging",
      "Effect": "Allow",
      "Resource": "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:instance/*",
      "Action": "ec2:CreateTags",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned"
        },
        "StringLike": {
          "aws:ResourceTag/karpenter.sh/nodepool": "*"
        },
        "ForAllValues:StringEquals": {
          "aws:TagKeys": [
            "karpenter.sh/nodeclaim",
            "Name"
          ]
        }
      }
    },
    {
      "Sid": "AllowScopedDeletion",
      "Effect": "Allow",
      "Resource": [
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:instance/*",
        "arn:${AWS_PARTITION}:ec2:${AWS_REGION}:*:launch-template/*"
      ],
      "Action": [
        "ec2:TerminateInstances",
        "ec2:DeleteLaunchTemplate"
      ],
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned"
        },
        "StringLike": {
          "aws:ResourceTag/karpenter.sh/nodepool": "*"
        }
      }
    },
    {
      "Sid": "AllowRegionalReadActions",
      "Effect": "Allow",
      "Resource": "*",
      "Action": [
        "ec2:DescribeAvailabilityZones",
        "ec2:DescribeImages",
        "ec2:DescribeInstances",
        "ec2:DescribeInstanceTypeOfferings",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeLaunchTemplates",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSpotPriceHistory",
        "ec2:DescribeSubnets"
      ],
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "${AWS_REGION}"
        }
      }
    },
    {
      "Sid": "AllowSSMReadActions",
      "Effect": "Allow",
      "Resource": "arn:${AWS_PARTITION}:ssm:${AWS_REGION}::parameter/aws/service/*",
      "Action": "ssm:GetParameter"
    },
    {
      "Sid": "AllowPricingReadActions",
      "Effect": "Allow",
      "Resource": "*",
      "Action": "pricing:GetProducts"
    },
    {
      "Sid": "AllowInterruptionQueueActions",
      "Effect": "Allow",
      "Resource": "arn:${AWS_PARTITION}:sqs:${AWS_REGION}:${AWS_ACCOUNT_ID}:${CLUSTER_NAME}",
      "Action": [
        "sqs:DeleteMessage",
        "sqs:GetQueueAttributes",
        "sqs:GetQueueUrl",
        "sqs:ReceiveMessage"
      ]
    },
    {
      "Sid": "AllowPassingInstanceRole",
      "Effect": "Allow",
      "Resource": "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/${INSTANCE_PROFILE}",
      "Action": "iam:PassRole",
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "ec2.amazonaws.com"
        }
      }
    },
    {
      "Sid": "AllowScopedInstanceProfileCreationActions",
      "Effect": "Allow",
      "Resource": "*",
      "Action": "iam:CreateInstanceProfile",
      "Condition": {
        "StringEquals": {
          "aws:RequestTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned",
          "aws:RequestTag/topology.kubernetes.io/region": "${AWS_REGION}"
        },
        "StringLike": {
          "aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*"
        }
      }
    },
    {
      "Sid": "AllowScopedInstanceProfileTagActions",
      "Effect": "Allow",
      "Resource": "*",
      "Action": "iam:TagInstanceProfile",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned",
          "aws:ResourceTag/topology.kubernetes.io/region": "${AWS_REGION}",
          "aws:RequestTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned",
          "aws:RequestTag/topology.kubernetes.io/region": "${AWS_REGION}"
        },
        "StringLike": {
          "aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*",
          "aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*"
        }
      }
    },
    {
      "Sid": "AllowScopedInstanceProfileActions",
      "Effect": "Allow",
      "Resource": "*",
      "Action": [
        "iam:AddRoleToInstanceProfile",
        "iam:RemoveRoleFromInstanceProfile",
        "iam:DeleteInstanceProfile"
      ],
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned",
          "aws:ResourceTag/topology.kubernetes.io/region": "${AWS_REGION}"
        },
        "StringLike": {
          "aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*"
        }
      }
    },
    {
      "Sid": "AllowInstanceProfileReadActions",
      "Effect": "Allow",
      "Resource": "*",
      "Action": "iam:GetInstanceProfile"
    },
    {
      "Sid": "AllowAPIServerEndpointDiscovery",
      "Effect": "Allow",
      "Resource": "arn:${AWS_PARTITION}:eks:${AWS_REGION}:${AWS_ACCOUNT_ID}:cluster/${CLUSTER_NAME}",
      "Action": "eks:DescribeCluster"
    }
  ]
}
EOF
aws iam create-policy --policy-name "${POLICY_NAME}" --policy-document file://controller-policy.json  --profile voice_dev

POLICY_ARN="arn:aws:iam::123445566676:policy/KarpenterControllerPolicy-eks-qa-v1beta1"
aws iam attach-role-policy --role-name "${ROLE_NAME}" --policy-arn "${POLICY_ARN}"  --profile voice_dev

export KARPENTER_VERSION=v0.37.0

helm  upgrade  karpenter-crd  oci://public.ecr.aws/karpenter/karpenter-crd  --version ${KARPENTER_VERSION} --namespace karpenter \
    --kube-context arn:aws:eks:us-east-1:12222222222:cluster/eks-qa

helm  upgrade  karpenter oci://public.ecr.aws/karpenter/karpenter --version ${KARPENTER_VERSION} --namespace karpenter \
    --set settings.aws.defaultInstanceProfile=${INSTANCE_PROFILE} \
    --set settings.aws.clusterName=${CLUSTER_NAME} \
    --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::12222222222:role/eks-qa-karpenter-controller" \
    --set controller.resources.requests.cpu=1 \
    --set controller.resources.requests.memory=1Gi \
    --set controller.resources.limits.cpu=1 \
    --set controller.resources.limits.memory=1Gi \
    --kube-context arn:aws:eks:us-east-1:12222222222:cluster/eks-qa

Kubelet logs


  Normal  Starting                 10m                  kube-proxy
  Normal  Starting                 10m                  kubelet                Starting kubelet.
  Normal  NodeHasSufficientMemory  10m (x2 over 10m)    kubelet                Node ip-10-211-186-14.ec2.internal status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    10m (x2 over 10m)    kubelet                Node ip-10-211-186-14.ec2.internal status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     10m (x2 over 10m)    kubelet                Node ip-10-211-186-14.ec2.internal status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  10m                  kubelet                Updated Node Allocatable limit across pods
  Normal  Synced                   10m                  cloud-node-controller  Node synced successfully
  Normal  RegisteredNode           10m                  node-controller        Node ip-10-211-186-14.ec2.internal event: Registered Node ip-10-211-186-14.ec2.internal in Controller
  Normal  NodeReady                10m                  kubelet                Node ip-10-211-186-14.ec2.internal status is now: NodeReady
  Normal  DisruptionBlocked        9m54s                karpenter              Cannot disrupt Node: Nominated for a pending pod
  Normal  NodeNotReady             2m38s                node-controller        Node ip-10-211-186-14.ec2.internal status is now: NodeNotReady
  Normal  DisruptionBlocked        30s (x3 over 7m53s)  karpenter              Cannot disrupt Node: PDB "sre-services/sre-hazelcast-pdb" prevents pod evictions

  Jul 31 20:12:32 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:32.269089    3344 webhook.go:154] Failed to make webhook authenticator request: Unauthorized
Jul 31 20:12:32 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:32.269160    3344 server.go:310] "Unable to authenticate the request due to an error" err="Unauthorized"
Jul 31 20:12:32 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:32.542093    3344 controller.go:145] "Failed to ensure lease exists, will retry" err="Unauthorized" interval="7s"
Jul 31 20:12:34 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:34.108095    3344 webhook.go:154] Failed to make webhook authenticator request: Unauthorized
Jul 31 20:12:34 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:34.108860    3344 server.go:310] "Unable to authenticate the request due to an error" err="Unauthorized"
Jul 31 20:12:34 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:34.109089    3344 server.go:310] "Unable to authenticate the request due to an error" err="Unauthorized"
Jul 31 20:12:34 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:34.327773    3344 kubelet_node_status.go:549] "Error updating node status, will retry" err="error getting node \"ip-10-211-182-17.ec2.internal\": Unauthorized"
Jul 31 20:12:34 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:34.375546    3344 kubelet_node_status.go:549] "Error updating node status, will retry" err="error getting node \"ip-10-211-182-17.ec2.internal\": Unauthorized"
Jul 31 20:12:34 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:34.427436    3344 kubelet_node_status.go:549] "Error updating node status, will retry" err="error getting node \"ip-10-211-182-17.ec2.internal\": Unauthorized"
Jul 31 20:12:34 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:34.478114    3344 kubelet_node_status.go:549] "Error updating node status, will retry" err="error getting node \"ip-10-211-182-17.ec2.internal\": Unauthorized"
Jul 31 20:12:34 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:34.526616    3344 kubelet_node_status.go:549] "Error updating node status, will retry" err="error getting node \"ip-10-211-182-17.ec2.internal\": Unauthorized"
Jul 31 20:12:34 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:34.526651    3344 kubelet_node_status.go:536] "Unable to update node status" err="update node status exceeds retry count"

ul 31 20:12:29 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:29.100768    3344 webhook.go:154] Failed to make webhook authenticator request: Unauthorized
Jul 31 20:12:29 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:29.100830    3344 server.go:310] "Unable to authenticate the request due to an error" err="Unauthorized"
Jul 31 20:12:29 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:29.101729    3344 server.go:310] "Unable to authenticate the request due to an error" err="Unauthorized"
Jul 31 20:12:29 ip-10-211-182-17.ec2.internal kubelet[3344]: W0731 20:12:29.145815    3344 reflector.go:539] object-"kubecost"/"kube-root-ca.crt": failed to list *v1.ConfigMap: Unauthorized
Jul 31 20:12:29 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:29.145857    3344 reflector.go:147] object-"kubecost"/"kube-root-ca.crt": Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Unauthorized
Jul 31 20:12:29 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:29.631723    3344 webhook.go:154] Failed to make webhook authenticator request: Unauthorized
Jul 31 20:12:29 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:29.631837    3344 server.go:310] "Unable to authenticate the request due to an error" err="Unauthorized"
Jul 31 20:12:29 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:29.631996    3344 server.go:310] "Unable to authenticate the request due to an error" err="Unauthorized"
Jul 31 20:12:29 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:29.862233    3344 webhook.go:253] Failed to make webhook authorizer request: Unauthorized
Jul 31 20:12:29 ip-10-211-182-17.ec2.internal kubelet[3344]: E0731 20:12:29.862274    3344 server.go:325] "Authorization error" err="Unauthorized" user="kube-apiserver-kubelet-client" verb="get" resource="nodes" subresource="metrics"
Jul 31 20:12:3

Versions:

njtran commented 1 month ago

Looks like you might have RBAC issues: err="error getting node \"ip-10-211-182-17.ec2.internal\": Unauthorized

Can you check that you have the karpenter roles and clusterroles and their bindings

mohammad-mahmoudian-dynata commented 1 month ago

Looks like you might have RBAC issues: err="error getting node "ip-10-211-182-17.ec2.internal": Unauthorized

Can you check that you have the karpenter roles and clusterroles and their bindings

Yes, I checked I have those in place. Please let me know if you need more information

% kubectl  roles -n karpenter
NAME        CREATED AT
karpenter   2024-07-31T06:04:01Z
$ eks-voice-qa % kubectl   rolebindings -n karpenter
NAME        ROLE             AGE
karpenter   Role/karpenter   2d11h
$ eks-voice-qa % kubectl  clusterroles | grep karpenter
karpenter                                                              2024-07-31T06:04:01Z
karpenter-admin                                                        2024-07-31T06:04:01Z
karpenter-core                                                         2024-07-31T06:04:01Z
$ eks-voice-qa % kubectl  clusterrolebindings | grep karpenter
karpenter                                                ClusterRole/karpenter                                                2d11h
karpenter-core                                           ClusterRole/karpenter-core                                           2d11h
njtran commented 1 month ago
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::12222222222:role/eks-qa-karpenter-controller" \

I see you're using IRSA. can you make sure IRSA is set up properly in your account?

njtran commented 1 month ago

It might be worth asking in our karpenter slack channel. Some users might be running into issues like you.