[Blueprints, 02-at-scale] EKS blueprints Node Terminator Handler add-ons fails

carlosrodlop commented 11 months ago

I would like to use the Node Termination handler for the at-scale blueprint but I am getting the following error

│ Error: creating IAM Policy (aws-node-termination-handler-20231218170201623700000007): MalformedPolicyDocument: Policy statement must contain resources.
│       status code: 400, request id: cc751055-d447-4206-9046-0afc3546c91c
│ 
│   with module.eks_blueprints_addons.module.aws_node_termination_handler.aws_iam_policy.this[0],
│   on .terraform/modules/eks_blueprints_addons.aws_node_termination_handler/main.tf line 242, in resource "aws_iam_policy" "this":
│  242: resource "aws_iam_policy" "this" {
│ 
╵

The pod is in RUNNING status and I can see in the logs the following

2023/12/18 17:07:50 WRN There was a problem monitoring for events error="AccessDenied: User: arn:aws:sts::324005994172:assumed-role/aws-node-termination-handler-20231218170135041900000006/1702918930359930054 is not authorized to perform: sqs:receivemessage on resource: arn:aws:sqs:us-east-1:324005994172:aws-nth-cbci-bp02-i318-eks because no identity-based policy allows the sqs:receivemessage action\n\tstatus code: 403, request id: 8b956325-4f0e-5082-8761-3edf31a835b4" event_type=SQS_MONITOR

wellsiau-aws commented 10 months ago

Terraform plan indicates that the missing resources are for ASGs:

  # module.eks_blueprints_addons.module.aws_node_termination_handler.aws_iam_policy.this[0] will be created
  + resource "aws_iam_policy" "this" {
      + arn         = (known after apply)
      + description = "IAM Policy for AWS Node Termination Handler"
      + id          = (known after apply)
      + name        = (known after apply)
      + name_prefix = "aws-node-termination-handler-"
      + path        = "/"
      + policy      = jsonencode(
            {
              + Statement = [
                  + {
                      + Action   = [
                          + "ec2:DescribeInstances",
                          + "autoscaling:DescribeTags",
                          + "autoscaling:DescribeAutoScalingInstances",
                        ]
                      + Effect   = "Allow"
                      + Resource = "*"
                    },
                  + {
                      + Action = "autoscaling:CompleteLifecycleAction"
                      + Effect = "Allow"
                    },

wellsiau-aws commented 10 months ago

The resources is taken from variable aws_node_termination_handler_asg_arns , which should be populated along with enable_aws_node_termination_handler

For example you could do:

  enable_aws_node_termination_handler = true
  aws_node_termination_handler_asg_arns = data.aws_autoscaling_groups.eks_node_groups.arns

where we took the ASGs via data source:

data "aws_autoscaling_groups" "eks_node_groups" {
  depends_on = [ module.eks ]
  filter {
    name   = "tag-key"
    values = ["eks:cluster-name"]
  }
}

carlosrodlop commented 10 months ago

I reponed this issue, the proposal by @wellsiau-aws works well for apply but not for destroy. The following error appears

│ Error: Invalid for_each argument
│ 
│   on .terraform/modules/eks_blueprints_addons/main.tf line 1547, in resource "aws_autoscaling_lifecycle_hook" "aws_node_termination_handler":
│ 1547:   for_each = { for k, v in var.aws_node_termination_handler_asg_arns : k => v if var.enable_aws_node_termination_handler }
│     ├────────────────
│     │ var.aws_node_termination_handler_asg_arns is a list of string, known only after apply
│     │ var.enable_aws_node_termination_handler is true
│ 
│ The "for_each" map includes keys derived from resource attributes that cannot be determined until apply, and so Terraform cannot determine the full set of keys that will identify the instances of this
│ resource.
│ 
│ When working with unknown values in for_each, it's better to define the map keys statically in your configuration and place apply-time results only in the map values.
│ 
│ Alternatively, you could use the -target planning option to first apply only the resources that the for_each value depends on, and then apply a second time to fully converge.

It leaves the tf files/project into a stuck state. It is not possible now apply either

Looking into the terraform state files the requested data is there

{
      "mode": "data",
      "type": "aws_autoscaling_groups",
      "name": "eks_node_groups",
      "provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
      "instances": [
        {
          "schema_version": 0,
          "attributes": {
            "arns": [
              "arn:aws:autoscaling:us-east-1:324005994172:autoScalingGroup:02142c6d-ed16-44cd-886f-002bd6e6b33d:autoScalingGroupName/eks-mg_cbApps-2024010915313845610000002a-42c6776c-e2ee-4cc3-77e9-8256bceb1b51",
               ...
              "arn:aws:autoscaling:us-east-1:324005994172:autoScalingGroup:fe9e0cec-1048-4619-993a-c177047ec420:autoScalingGroupName/eks-mg_k8sApps_1az-20240109153138455200000026-72c6776c-e2ed-f4d6-1f8b-77d268987c2a"
            ],
            "filter": [
              {
                "name": "tag-key",
                "values": [
                  "eks:cluster-name"
                ]
              }
            ],
            "id": "us-east-1",
            "names": [
              "eks-cbc-aaaaaaaaaa-eks-node-group-v1-xxxxxxxxxxxx",
               ...
             "eks-cbc-bbbbbbbbbbb-eks-node-group-v2-xxxxxxxxxxxx",
            ]
          },
          "sensitive_attributes": []
        }
      ]
    },

The issue got solved by commenting aws_node_termination_handler_asg_arns and then it could be performed a destroy

 enable_aws_node_termination_handler   = false
  #aws_node_termination_handler_asg_arns = data.aws_autoscaling_groups.eks_node_groups.arns

carlosrodlop commented 10 months ago

Looking at the Complete test case https://github.com/aws-ia/terraform-aws-eks-blueprints-addons/blob/main/tests/complete/main.tf seems that right configuration is only valid with self_managed_node_groups which contains autoscaling_group_arn as output. eks_managed_node_groups submodule does not.

enable_aws_node_termination_handler   = true
aws_node_termination_handler_asg_arns = [for asg in module.eks.self_managed_node_groups : asg.autoscaling_group_arn]

wellsiau-aws commented 9 months ago

good catch, I was taking shortcut earlier and after diving deeper I saw a few limitation.

it boils down to the availability of the managed node group Auto scaling gorup ARNs to be populated deterministically instead of using data source.

looking at the [EKS module] I can see that only the ASG name is available as the output

Further down, the aws_eks_node_group resource itself does not have ASG arns as the reference, this itself boils down to the EKS API itself

carlosrodlop commented 9 months ago

It can not be implemented for Managed Node Groups

cloudbees-oss / terraform-aws-cloudbees-ci-eks-addon

[Blueprints, 02-at-scale] EKS blueprints Node Terminator Handler add-ons fails #23