aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.19k stars 856 forks source link

Nodes are created but are not being registered to the cluster #6303

Open nalshamaajc opened 1 month ago

nalshamaajc commented 1 month ago

Description

Observed Behavior: Nodes are created but are not being registered to the cluster

Expected Behavior: Nodes are created and are registered to the cluster

Reproduction Steps (Please include YAML): Try upgrade from v0.31.5 to v0.32.0 and follow the upgrade guide I skipped some steps because I use a helm chart.

I was also using the role parameter and not the instanceProfile in the ec2NodeClass.

The nodepool creates a nodeClaim but does not create the node.

I got the below in the nodeClaim status section.

status:
  allocatable:
    cpu: 7910m
    ephemeral-storage: 107Gi
    memory: 14162Mi
    pods: '58'
    vpc.amazonaws.com/pod-eni: '38'
  capacity:
    cpu: '8'
    ephemeral-storage: 120Gi
    memory: 15155Mi
    pods: '58'
    vpc.amazonaws.com/pod-eni: '38'
  conditions:
    - lastTransitionTime: '2024-05-30T13:31:00Z'
      message: Node not registered with cluster
      reason: NodeNotFound
      status: 'False'
      type: Initialized
    - lastTransitionTime: '2024-05-30T13:31:00Z'
      status: 'True'
      type: Launched
    - lastTransitionTime: '2024-05-30T13:31:00Z'
      message: Node not registered with cluster
      reason: NodeNotFound
      status: 'False'
      type: Ready
    - lastTransitionTime: '2024-05-30T13:31:00Z'
      message: Node not registered with cluster
      reason: NodeNotFound
      status: 'False'
      type: Registered
  imageID: ami-0fe93db2c62573b1a
  providerID: 'aws:///us-west-2c/i-0f84d19de2821fd6c'

In the logs I noticed the below entry

"level":"DEBUG","time":"2024-05-30T13:46:01.646Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"f0eb822","nodeclaim":"default-hbgnd","nodepool":"default","ttl":"15m0s"}

The docs state that when using a role it should be able to create/manage an instance profile, but it seems that it doesn't add the role/service account mapping to the aws-auth configmap when using IRSA.

Migration guide

For each EC2NodeClass, specify the $KARPENTER_NODE_ROLE you will use for nodes launched with this node class. Karpenter v1beta1 drops the need for managing your own instance profile and uses node roles directly. The example below shows how to migrate your AWSNodeTemplate to an EC2NodeClass if your node role is the same role that was used when creating your cluster with the Getting Started Guide.

changelog

Karpenter will now auto-generate the instance profile in your EC2NodeClass, given the role that you specify.

Solution

The problem was solved when I used the instanceProfile instead of role for the ec2NodeClass, the instanceProfile was already configured in the aws-auth configMap which made it work.

Versions: v0.32.10

engedaam commented 1 month ago

Can you share the role that was being added to the EC2NodeClass?

nalshamaajc commented 1 month ago

@engedaam I assume you're interested in the policy attached to this role.

{
    "Statement": [
        {
            "Action": [
                "ec2:RunInstances",
                "ec2:CreateFleet"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:ec2:REGION::snapshot/*",
                "arn:aws:ec2:REGION::image/*",
                "arn:aws:ec2:REGION:*:subnet/*",
                "arn:aws:ec2:REGION:*:spot-instances-request/*",
                "arn:aws:ec2:REGION:*:security-group/*",
                "arn:aws:ec2:REGION:*:launch-template/*"
            ],
            "Sid": "AllowScopedEC2InstanceActions"
        },
        {
            "Action": [
                "ec2:RunInstances",
                "ec2:CreateLaunchTemplate",
                "ec2:CreateFleet"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/kubernetes.io/cluster/CLUSTER": "owned"
                },
                "StringLike": {
                    "aws:RequestTag/karpenter.sh/nodepool": "*"
                }
            },
            "Effect": "Allow",
            "Resource": [
                "arn:aws:ec2:REGION:*:volume/*",
                "arn:aws:ec2:REGION:*:network-interface/*",
                "arn:aws:ec2:REGION:*:launch-template/*",
                "arn:aws:ec2:REGION:*:instance/*",
                "arn:aws:ec2:REGION:*:fleet/*"
            ],
            "Sid": "AllowScopedEC2InstanceActionsWithTags"
        },
        {
            "Action": "ec2:CreateTags",
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/kubernetes.io/cluster/CLUSTER": "owned",
                    "ec2:CreateAction": [
                        "RunInstances",
                        "CreateFleet",
                        "CreateLaunchTemplate"
                    ]
                },
                "StringLike": {
                    "aws:RequestTag/karpenter.sh/nodepool": "*"
                }
            },
            "Effect": "Allow",
            "Resource": [
                "arn:aws:ec2:REGION:*:volume/*",
                "arn:aws:ec2:REGION:*:network-interface/*",
                "arn:aws:ec2:REGION:*:launch-template/*",
                "arn:aws:ec2:REGION:*:instance/*",
                "arn:aws:ec2:REGION:*:fleet/*"
            ],
            "Sid": "AllowScopedResourceCreationTagging"
        },
        {
            "Action": "ec2:CreateTags",
            "Condition": {
                "ForAllValues:StringEquals": {
                    "aws:TagKeys": [
                        "karpenter.sh/nodeclaim",
                        "Name"
                    ]
                },
                "StringEquals": {
                    "aws:ResourceTag/kubernetes.io/cluster/CLUSTER": "owned"
                },
                "StringLike": {
                    "aws:ResourceTag/karpenter.sh/nodepool": "*"
                }
            },
            "Effect": "Allow",
            "Resource": "arn:aws:ec2:REGION:*:instance/*",
            "Sid": "AllowScopedResourceTagging"
        },
        {
            "Action": [
                "ec2:TerminateInstances",
                "ec2:DeleteLaunchTemplate"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/kubernetes.io/cluster/CLUSTER": "owned"
                },
                "StringLike": {
                    "aws:ResourceTag/karpenter.sh/nodepool": "*"
                }
            },
            "Effect": "Allow",
            "Resource": [
                "arn:aws:ec2:REGION:*:launch-template/*",
                "arn:aws:ec2:REGION:*:instance/*"
            ],
            "Sid": "AllowScopedDeletion"
        },
        {
            "Action": [
                "ec2:DescribeSubnets",
                "ec2:DescribeSpotPriceHistory",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeInstanceTypeOfferings",
                "ec2:DescribeImages",
                "ec2:DescribeAvailabilityZones"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:RequestedRegion": "REGION"
                }
            },
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "AllowRegionalReadActions"
        },
        {
            "Action": "ssm:GetParameter",
            "Effect": "Allow",
            "Resource": "arn:aws:ssm:REGION::parameter/aws/service/*",
            "Sid": "AllowSSMReadActions"
        },
        {
            "Action": "pricing:GetProducts",
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "AllowPricingReadActions"
        },
        {
            "Action": [
                "sqs:ReceiveMessage",
                "sqs:GetQueueUrl",
                "sqs:GetQueueAttributes",
                "sqs:DeleteMessage"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:sqs:REGION:xxxxxxxxxxxxxx:CLUSTER",
            "Sid": "AllowInterruptionQueueActions"
        },
        {
            "Action": "iam:PassRole",
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "ec2.amazonaws.com"
                }
            },
            "Effect": "Allow",
            "Resource": "arn:aws:iam::xxxxxxxxxxxxxx:role/CLUSTER",
            "Sid": "AllowPassingInstanceRole"
        },
        {
            "Action": "iam:CreateInstanceProfile",
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/kubernetes.io/cluster/CLUSTER": "owned",
                    "aws:RequestTag/topology.kubernetes.io/region": "REGION"
                },
                "StringLike": {
                    "aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*"
                }
            },
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "AllowScopedInstanceProfileCreationActions"
        },
        {
            "Action": "iam:TagInstanceProfile",
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/kubernetes.io/cluster/CLUSTER": "owned",
                    "aws:RequestTag/topology.kubernetes.io/region": "REGION",
                    "aws:ResourceTag/kubernetes.io/cluster/CLUSTER": "owned",
                    "aws:ResourceTag/topology.kubernetes.io/region": "REGION"
                },
                "StringLike": {
                    "aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*",
                    "aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*"
                }
            },
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "AllowScopedInstanceProfileTagActions"
        },
        {
            "Action": [
                "iam:RemoveRoleFromInstanceProfile",
                "iam:DeleteInstanceProfile",
                "iam:AddRoleToInstanceProfile"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/kubernetes.io/cluster/CLUSTER": "owned",
                    "aws:ResourceTag/topology.kubernetes.io/region": "REGION"
                },
                "StringLike": {
                    "aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*"
                }
            },
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "AllowScopedInstanceProfileActions"
        },
        {
            "Action": "iam:GetInstanceProfile",
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "AllowInstanceProfileReadActions"
        },
        {
            "Action": "eks:DescribeCluster",
            "Effect": "Allow",
            "Resource": "arn:aws:eks:REGION:xxxxxxxxxxxxxx:cluster/CLUSTER",
            "Sid": "AllowAPIServerEndpointDiscovery"
        }
    ],
    "Version": "2012-10-17"
}
engedaam commented 1 month ago

I'm interested in the KARPENTER_NODE_ROLE here. This would be the role that was specified in the EC2NodeClass. Also have followed looked at the troubleshooting for guidance on this issue? https://karpenter.sh/docs/troubleshooting/#node-notready

nalshamaajc commented 1 month ago

@engedaam I cannot expose the role details here, are there specific details that you are interested in? also nodes wasn't showing in the node list but checking it under the ec2 instance tab showed that it is created and ready.

engedaam commented 1 month ago

I understand, I was mainly looking to make sure the node role contains permission to join the cluster. By default, here is the node role the team recommends: https://karpenter.sh/v0.32/reference/cloudformation/#node-authorization. Also the troubleshooting guide give steps on how to understand why a node might not joining the cluster and steps to fix the issue.

engedaam commented 1 month ago

@nalshamaajc Any progress in getting nodes to join?

solidaeon commented 1 month ago

is there a way to automattically patch the aws-auth cm to add the KarpenterNode role?

github-actions[bot] commented 2 weeks ago

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

gagbo commented 4 days ago

I’m also trying to find something for that but I got nothing. Should I use the aws-auth module with output values from the karpenter module? Or should I use the iam role for irsa module ? Or both, with a specific argument configuration?