aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.82k stars 959 forks source link

WebIdentityErr: failed to retrieve credentials #1313

Closed Izvi-digibank closed 2 years ago

Izvi-digibank commented 2 years ago

I am using Karpenter for two clusters under the same AWS account. Same roles are being used for both clusters, provisioners are the same (private subnet). aws-auth is configured with the KarpenternodeRole-cluster. Cluster a works perfectly, but in cluster b I get the following error:

2022-02-10T10:33:35.421Z ERROR controller.controller.provisioning Reconciler error {"commit": "2346ed5", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "name": "workflows-provisioner", "namespace": "", "error": "fetching instance types using ec2.DescribeInstanceTypes, WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.eu-west-1.amazonaws.com/\": dial tcp: i/o timeout"}

Here are some details:

KarpenterController role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:CreateLaunchTemplate",
                "ec2:CreateFleet",
                "ec2:RunInstances",
                "ec2:CreateTags",
                "iam:PassRole",
                "ec2:TerminateInstances",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeInstances",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeInstanceTypeOfferings",
                "ec2:DescribeAvailabilityZones",
                "ssm:GetParameter"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

KarpenterController trust relationships:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": [
                    "arn:aws:iam::***:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/clusterA",
                    "arn:aws:iam::***:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/clusterB",
                ]
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringLike": {
                    "oidc.eks.eu-west-1.amazonaws.com/id/clusterA": "system:serviceaccount:karpenter:karpenter",
                    "oidc.eks.eu-west-1.amazonaws.com/id/clusterB": "system:serviceaccount:karpenter:karpenter"
                }
            }
        }
    ]
}

KarpenterNodeInstanceProfile-clusterB has the following policies:

AmazonEKSWorkerNodePolicy
AmazonEC2ContainerRegistryReadOnly
AmazonSSMManagedInstanceCore
AmazonEKS_CNI_Policy

KarpenterNodeInstanceProfile-clusterB trust relationships:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

When trying to retrive node id in clusterB I get null:

➜  ~ kubectl get node -l karpenter.sh/provisioner-name -ojson | jq -r ".items[0].spec.providerID" | cut -d \/ -f5
null
bwagner5 commented 2 years ago

The error you posted generally indicates an issue with Internet connectivity of the Karpenter controller pod. This could be due to the node's connectivity, the pod's connectivity (the IP assigned via the CNI), or DNS (node or pod level).

Can you double check your node's connectivity by curl'ing something from the instance? If that works, you'll want to check the pod's network configuration and try to access the network from the pod itself.

If the node is unable to access Internet resources, you can check the subnet's route table for a proper NAT GW or Internet GW setup. Another thing to check is the outbound security group rules on the node.

RequestError: send request failed\ncaused by: Post \"https://sts.eu-west-1.amazonaws.com/\": dial tcp: i/o timeout"}
Izvi-digibank commented 2 years ago

@bwagner5 I guess your'e right. I moved now Karpenter controller and webhook to run on a node with the exact same configuration as the node in clusterA. Seem to not having network issues anymore but I get another error:

2022-02-10T21:35:41.273Z ERROR controller.controller.provisioning Reconciler error {"commit": "2346ed5", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "name": "workflows-provisioner", "namespace": "", "error": "fetching instance types using ec2.DescribeInstanceTypes, WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 4e07a374-081d-4466-8b67-6421e6a3022e"}

This is weird because the Node IAM role has AssumeRoleWithWebIdentity and it is also defined in the trust policy (you can see I pasted it above). All other roles and cm are well configured as explain in the question.

suket22 commented 2 years ago

@Izvi-digibank is this still an issue? Were you able to figure out what was going on here?

Izvi-digibank commented 2 years ago

@suket22 Yes, needed to separate KarpenterController into two statements.

hedasaurabh commented 2 years ago

@Izvi-digibank Can you post the output here for KarpenterController?

Izvi-digibank commented 2 years ago

My controller in functioning fine by now hence I closed the issue. All I had to do as I wrote above is to separate the KarpenterController into two different statements (did it through aws console but you can do it by terraform if you use)

gaganyaan2 commented 1 year ago

I had the same issue. I also resolved it by separating the Trust relationships statement. Thank you @Izvi-digibank

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::123456789:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/CLUSTER1-RANDOM-NUMBER"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "oidc.eks.eu-west-1.amazonaws.com/id/CLUSTER1-RANDOM-NUMBER:sub": "system:serviceaccount:karpenter:karpenter"
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::123456789:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/CLUSTER2-RANDOM-NUMBER"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "oidc.eks.eu-west-1.amazonaws.com/id/CLUSTER2-RANDOM-NUMBER:sub": "system:serviceaccount:karpenter:karpenter"
                }
            }
        }
    ]
}