aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.77k stars 953 forks source link

The security token included in the request is invalid in GovCloud #5432

Closed vstthomas closed 7 months ago

vstthomas commented 10 months ago

Description

Observed Behavior:

I saw this bug, which looks close, and added a question at the end. No reply so I'm revisiting this issue:

I've Terraformed a VPC/EKS cluster using AWS modules. Everything works as expected.

After that, karpenter was sent to the system via blueprints/addons. Then a NodePool was configured.

Using the same Terraform/process/automation:

  1. In a commercial account, Karpenter works as expected ✅
  2. In a government account, Karpenter has a few issues 🚫 a. it can't delete the nodes that it creates. (which seems like a blueprint/implementation issue) b. The security token included in the request is invalid (seems like a provider issue)

{"level":"ERROR","time":"2023-12-29T01:32:02.192Z","logger":"controller.pricing","message":"retreiving on-demand pricing data, UnrecognizedClientException: The security token included in the request is invalid\n\tstatus code: 400, request id: 5d038231-2332-4aef-b888-dbe037b60dc2; UnrecognizedClientException: The security token included in the request is invalid\n\tstatus code: 400, request id: a0b548d6-4833-4883-84ac-84db45806e7b","commit":"1072d3b"}

Expected Behavior:

Karpenter works the same in GovCloud as it does in a commercial account.

Reproduction Steps (Please include YAML):

Government Partition

  1. Bootstrap VPC/EKS using AWS-provided modules; validate ✅
  2. Include blueprints/addons; validate: mixed bag
  3. Include NodeGroup config (straight from the docs)
  4. Scale up: fail

The token, presented to some service is rejected per the log message above.

I'm not quite sure which service is rejecting the token but I'm willing to work towards the solution.

Versions:

% tf version
Terraform v1.6.3
on darwin_arm64
+ provider registry.terraform.io/hashicorp/aws v5.25.0
+ provider registry.terraform.io/hashicorp/cloudinit v2.3.3
+ provider registry.terraform.io/hashicorp/helm v2.12.1
+ provider registry.terraform.io/hashicorp/kubernetes v2.25.0
+ provider registry.terraform.io/hashicorp/time v0.10.0
+ provider registry.terraform.io/hashicorp/tls v4.0.5

Chart Version:

% helm list -A
NAME                                    NAMESPACE   REVISION    UPDATED                                 STATUS      CHART                                       APP VERSION
aws-fsx-csi-driver                      kube-system 1           2024-01-04 08:46:23.368508 -0800 PST    deployed    aws-fsx-csi-driver-1.7.0                    1.0.0      
aws-load-balancer-controller            kube-system 1           2024-01-04 08:46:31.840446 -0800 PST    deployed    aws-load-balancer-controller-1.6.0          v2.6.0     
external-dns                            kube-system 1           2024-01-04 08:46:59.126065 -0800 PST    deployed    external-dns-1.13.0                         0.13.5     
karpenter                               karpenter   1           2024-01-04 08:46:42.508527 -0800 PST    deployed    karpenter-v0.32.1                           0.32.1     
metrics-server                          kube-system 1           2024-01-04 08:46:10.583084 -0800 PST    deployed    metrics-server-3.11.0                       0.6.4      
secrets-store-csi-driver                kube-system 1           2024-01-04 08:45:53.116647 -0800 PST    deployed    secrets-store-csi-driver-1.3.4              1.3.4      
secrets-store-csi-driver-provider-aws   kube-system 1           2024-01-04 08:46:17.884518 -0800 PST    deployed    secrets-store-csi-driver-provider-aws-0.3.4            
% kubectl version --client=false
Client Version: v1.28.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.4-eks-8cb36c9
jonathan-innis commented 9 months ago

UnrecognizedClientException: The security token included in the request is invalid

From what I remember, this error only comes up when the server doesn't recognize the principal at all. I'm assuming that you are using IRSA so have you made sure that everything is wired-up correctly so that the pod absolutely has access to a role? It's surprising to me that you would only be seeing this issue in the Gov account, but also I don't think that this issue would be unique to a Gov account versus the standard partition.

jonathan-innis commented 9 months ago

it can't delete the nodes that it creates

I responded to the blueprints issue that you referenced. Since that's more of an issue with blueprints, I'd rather continue the conversation over there so that we can see how we can drive their policy to be closer to our official policy. For now, you should be able to change the defaults that are used in that TerminateInstances ABAC policy with karpenter.irsa_tag_key and karpenter.irsa_tag_values.

vstthomas commented 9 months ago

For now, you should be able to change the defaults that are used in that TerminateInstances ABAC policy with karpenter.irsa_tag_key and karpenter.irsa_tag_values.

I'm Terraforming, I see references in the module to irsa_tag_key and irsa_tag_values but I don't see anything that explains how to code that up within the context of the blueprints.

Is there an example/docs you could share?

vstthomas commented 9 months ago

The https://github.com/aws-ia/terraform-aws-eks-blueprints-addons/issues/339#issuecomment-1883964653 solution fixes this issue: the log messages about the security token just evaporated.

Added this to the configuration

module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.12.0" #ensure to update this to the latest/desired version
...
  # --------------------------------------------------------------------------------------------------------------------
  # Auto-Scaling
  # karpenter:   https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/
  # AWS Samples: https://github.com/aws-samples/karpenter-blueprints/blob/main/cluster/terraform/karpenter.tf
  # --------------------------------------------------------------------------------------------------------------------
  enable_karpenter                           = true
  karpenter_enable_spot_termination          = true
  karpenter_enable_instance_profile_creation = true

  karpenter_node = {
    iam_role_use_name_prefix = false
  }

 # Solution from the above issue
  karpenter = {
    irsa_tag_key   = "aws:ResourceTag/kubernetes.io/cluster/${var.cluster_name}"
    irsa_tag_value = "*"
  }
}

The plan added this to the existing policy:

Terraform will perform the following actions:

  # module.eks_blueprints_addons.module.karpenter.aws_iam_policy.this[0] will be updated in-place
  ~ resource "aws_iam_policy" "this" {
        id          = "arn:aws-us-gov:iam::010101010101:policy/karpenter-20240108171158929300000026"
        name        = "karpenter-20240108171158929300000026"
      ~ policy      = jsonencode(
          ~ {
              ~ Statement = [
                    # (5 unchanged elements hidden)
                    {
                        Action   = "eks:DescribeCluster"
                        Effect   = "Allow"
                        Resource = "arn:aws-us-gov:eks:*:010101010101:cluster/gitops-demo-stage"
                    },
                  ~ {
                      ~ Condition = {
                          ~ StringLike = {
                              - "ec2:ResourceTag/Name"                                                    = [
                                  - "*karpenter*",
                                  - "*compute.internal",
                                  - "*ec2.internal",
                                ]
                              + "ec2:ResourceTag/aws:ResourceTag/kubernetes.io/cluster/gitops-demo-stage" = [
                                  + "*karpenter*",
                                  + "*compute.internal",
                                  + "*ec2.internal",
                                ]
                            }
                        }
                        # (3 unchanged attributes hidden)
                    },
                    {
                        Action   = [
                            "sqs:ReceiveMessage",
                            "sqs:GetQueueUrl",
                            "sqs:GetQueueAttributes",
                            "sqs:DeleteMessage",
                        ]
                        Effect   = "Allow"
                        Resource = "arn:aws-us-gov:sqs:us-gov-east-1:010101010101:karpenter-gitops-demo-stage"
                    },
                    # (1 unchanged element hidden)
                ]
                # (1 unchanged attribute hidden)
            }
        )
        tags        = {}
        # (6 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Leaving a resultant policy of

# module.eks_blueprints_addons.module.karpenter.aws_iam_policy.this[0]:
resource "aws_iam_policy" "this" {
    arn         = "arn:aws-us-gov:iam::010101010101:policy/karpenter-20240108171158929300000026"
    description = "IAM Policy for karpenter"
    id          = "arn:aws-us-gov:iam::010101010101:policy/karpenter-20240108171158929300000026"
    name        = "karpenter-20240108171158929300000026"
    name_prefix = "karpenter-"
    path        = "/"
    policy      = jsonencode(
        {
            Statement = [
                {
                    Action   = [
                        "ec2:DescribeSubnets",
                        "ec2:DescribeSpotPriceHistory",
                        "ec2:DescribeSecurityGroups",
                        "ec2:DescribeLaunchTemplates",
                        "ec2:DescribeInstances",
                        "ec2:DescribeInstanceTypes",
                        "ec2:DescribeInstanceTypeOfferings",
                        "ec2:DescribeImages",
                        "ec2:DescribeAvailabilityZones",
                    ]
                    Effect   = "Allow"
                    Resource = "*"
                },
                {
                    Action   = [
                        "ec2:RunInstances",
                        "ec2:DeleteLaunchTemplate",
                        "ec2:CreateTags",
                        "ec2:CreateLaunchTemplate",
                        "ec2:CreateFleet",
                    ]
                    Effect   = "Allow"
                    Resource = [
                        "arn:aws-us-gov:ec2:us-gov-east-1::image/*",
                        "arn:aws-us-gov:ec2:us-gov-east-1:010101010101:*",
                    ]
                },
                {
                    Action   = "iam:PassRole"
                    Effect   = "Allow"
                    Resource = "arn:aws-us-gov:iam::010101010101:role/karpenter-gitops-demo-stage"
                },
                {
                    Action   = "pricing:GetProducts"
                    Effect   = "Allow"
                    Resource = "*"
                },
                {
                    Action   = "ssm:GetParameter"
                    Effect   = "Allow"
                    Resource = "arn:aws-us-gov:ssm:us-gov-east-1::parameter/*"
                },
                {
                    Action   = "eks:DescribeCluster"
                    Effect   = "Allow"
                    Resource = "arn:aws-us-gov:eks:*:010101010101:cluster/gitops-demo-stage"
                },
                {
                    Action    = "ec2:TerminateInstances"
                    Condition = {
                        StringLike = {
                            "ec2:ResourceTag/aws:ResourceTag/kubernetes.io/cluster/gitops-demo-stage" = [
                                "*karpenter*",
                                "*compute.internal",
                                "*ec2.internal",
                            ]
                        }
                    }
                    Effect    = "Allow"
                    Resource  = "arn:aws-us-gov:ec2:us-gov-east-1:010101010101:instance/*"
                },
                {
                    Action   = [
                        "sqs:ReceiveMessage",
                        "sqs:GetQueueUrl",
                        "sqs:GetQueueAttributes",
                        "sqs:DeleteMessage",
                    ]
                    Effect   = "Allow"
                    Resource = "arn:aws-us-gov:sqs:us-gov-east-1:010101010101:karpenter-gitops-demo-stage"
                },
                {
                    Action   = [
                        "iam:TagInstanceProfile",
                        "iam:RemoveRoleFromInstanceProfile",
                        "iam:GetInstanceProfile",
                        "iam:DeleteInstanceProfile",
                        "iam:CreateInstanceProfile",
                        "iam:AddRoleToInstanceProfile",
                    ]
                    Effect   = "Allow"
                    Resource = "*"
                },
            ]
            Version   = "2012-10-17"
        }
    )
    policy_id   = "ANPAVLGOHKROSQWJKMUQT"
    tags        = {}
    tags_all    = {}
}

We can close this one out. Thank you!

jonathan-innis commented 9 months ago

Glad to hear fixing the policy resolved the issue. We're working on getting that fix merged in the EKS blueprints repo so that less users hit this in the future.

vstthomas commented 9 months ago

This one has regressed; just noticed another siting today:

{"level":"ERROR","time":"2024-01-30T23:03:19.803Z","logger":"controller.pricing","message":"retreiving on-demand pricing data, UnrecognizedClientException: The security token included in the request is invalid\n\tstatus code: 400, request id: 6c87f8cb-5104-40de-9063-39184170652f; UnrecognizedClientException: The security token included in the request is invalid\n\tstatus code: 400, request id: 413339af-7dad-4baf-a48f-ae5799bb87a7","commit":"1072d3b"}

Reproduction https://github.com/VivSoftOrg/reproduction/tree/karpenter-iam

github-actions[bot] commented 8 months ago

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

vstthomas commented 8 months ago

bump

ntman4real commented 8 months ago

same issue here in gov cloud

github-actions[bot] commented 8 months ago

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

philgladman commented 5 months ago

Getting the same issue as well. In AWS GovCloud, using Karpenter v0.36.0. There are no GovCloud endpoints for the pricing API, so I am assuming thats why there is an error. To get arround this error, we set settings.isolatedVPC: true in the helm chart.

{"level":"ERROR","time":"","logger":"controller.pricing","message":"updating pricing, retreiving on-demand pricing data, UnrecognizedClientException: The security token included in the request is invalid\n\tstatus code: 400, request id: xxxxxxxxxxx; UnrecognizedClientException: The security token included in the request is invalid\n\tstatus code: 400, request id: xxxxxxxxxx","commit":"6b868db"}

Screenshot 2024-05-20 at 1 57 04 PM https://docs.aws.amazon.com/general/latest/gr/billing.html

jonathan-innis commented 3 months ago

There are no GovCloud endpoints for the pricing API, so I am assuming thats why there is an error

When we are trying to hit the pricing API, we go to the us-east-1 endpoint which contains information on the pricing in gov cloud. Assuming that the principal and role that you are using here has permission to make the call cross-region in us-east-1, I don't believe that you should be running into this issue. Karpenter's policy by default does not scope down the region that the principal can make calls from in the pricing API

aceat64 commented 2 months ago

Because GovCloud is in a different partition, it's not possible to create an IAM role with cross-region permissions to us-east-1.

https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/partitions.html

You cannot use IAM credentials from one partition to interact with resources in a different partition.

michaelfedell commented 1 week ago

Chiming in to note the same issue for our GovCloud installation.

I think the "baked-in" static pricing data is probably good enough for our use case and avoids the complexity of setting up with iam user credentials for a commercial partition to make the getPricing call dynamically. However, even settling for the static pricing, there does not seem to be a way to configure karpenter to skip any getPricing calls which results in logs cluttered with the ERROR: ... updating pricing, retreiving on-demand pricing data, UnrecognizedClientException messages.

Would it be possible to include a configuration value for skipping the dynamic pricing updates? It seems that the settings.isolatedVPC accomplishes this indirectly based on comments above, but it would be nice to have more direct control over just the pricing update calls.