aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.86k stars 967 forks source link

Spot interruption events are not delivered to the SQS queue #7209

Open thelateperseus opened 1 month ago

thelateperseus commented 1 month ago

Description

How can the docs be improved?

SQS queue permissions The CloudFormation sample in the docs includes an SQS queue with an Amazon-managed KMS key for server-side encryption. However, I believe that the SQS queue needs to use a customer-managed KMS key instead.

From the "My events are not delivered to the target Amazon SQS queue" topic in the EventBridge Troubleshooting documentation:

If your Amazon SQS queue is encrypted, you must create a customer-managed KMS key and include the following permission section in your KMS key policy. For more information, see Configuring AWS KMS permissions.

The referenced Configuring AWS KMS permissions page also describes additional permissions for the receiver (Karpenter IAM role).

When using an Amazon-managed key, the FailedInvocations graph exactly matched the Invocations graph for my EventBridge rule. Likewise, the Number Of Messages Received graph was always zero for the SQS queue. After switching to a customer-managed KMS key and updating the IAM permissions as documented, the FailedInvocations for EventBridge rule disappeared, and the SQS Number Of Messages Received graph shows the expected message counts.

Is this an error with Karpenter's CloudFormation example, or am I doing something wrong?

engedaam commented 1 month ago

What are the SQS configuration that you applied? did you use the cloudformation template provider in our getting started guide? https://github.com/aws/karpenter-provider-aws/blob/main/website/content/en/v1.0/getting-started/getting-started-with-karpenter/cloudformation.yaml

thelateperseus commented 1 month ago

We use Terraform as our IaC tool of choice, so I didn't use the CloudFormation template directly. However, I did translate the CloudFormation template into Terraform HCL. Below is the old configuration which did not work:

# SQS queue used to receive EC2 spot interruption notices
resource "aws_sqs_queue" "karpenter_interruption" {
  name                      = "karpenter-interruption-${aws_eks_cluster.test.name}"
  message_retention_seconds = 300
  sqs_managed_sse_enabled   = true
}

data "aws_iam_policy_document" "karpenter_interruption_queue" {
  statement {
    sid     = "SendEventsToQueue"
    actions = ["sqs:SendMessage"]
    principals {
      type = "Service"
      identifiers = [
        "events.amazonaws.com",
        "sqs.amazonaws.com",
      ]
    }
  }
  statement {
    sid       = "DenyHTTP"
    effect    = "Deny"
    actions   = ["sqs:*"]
    resources = [aws_sqs_queue.karpenter_interruption.arn]
    principals {
      type        = "AWS"
      identifiers = ["*"]
    }
    condition {
      test     = "Bool"
      variable = "aws:SecureTransport"
      values   = [false]
    }
  }
}

resource "aws_sqs_queue_policy" "karpenter_interruption" {
  queue_url = aws_sqs_queue.karpenter_interruption.id
  policy    = data.aws_iam_policy_document.karpenter_interruption_queue.json
}

Removing sqs_managed_sse_enabled on the aws_sqs_queue and adding kms_master_key_id with a customer-managed key resolved the issue (along with the appropriate IAM permissions as per the docs I linked above).