guru1602 commented 3 weeks ago

Describe the Bug

I am using below config to create a windows node group using the latest version of the module, node gets created but fails to join the cluster.

module "worker_label_green" { source = "cloudposse/label/null"

namespace = var.namespace name = var.name stage = var.stage delimiter = var.delimiter attributes = var.attributes tags = merge(var.tags, { "kubernetes.io/cluster/${var.cluster_name}" = "owned" }) }

module "eks_web_node_group_green" { source = "cloudposse/eks-node-group/aws" version = "3.1.0"

enabled = var.green_enabled context = module.worker_label_green.context

instance_types = var.instance_types subnet_ids = local.worker_subnet_ids min_size = var.min_size max_size = var.max_size desired_size = var.desired_size cluster_name = data.terraform_remote_state.eks_cluster.outputs.eks_cluster_id kubernetes_version = var.kubernetes_version == null || var.kubernetes_version == "" ? [data.terraform_remote_state.eks_cluster.outputs.eks_cluster_version] : [var.kubernetes_version] kubernetes_labels = var.labels

ami_type = var.ami_type

before_cluster_joining_userdata = [ data.template_file.pre_eks_worker_nt.rendered ] after_cluster_joining_userdata = [ data.template_file.post_eks_worker_nt.rendered ] kubernetes_taints = [{ key = "OS" value = "Windows" effect = "NO_SCHEDULE" }]

update_config = [{ max_unavailable = var.desired_size }]

capacity_type = var.capacity_type

detailed_monitoring_enabled = true

node_role_arn = [data.aws_iam_role.worker_role.arn] node_role_cni_policy_enabled = false #We use the Service Account as per best practice

associated_security_group_ids = [ data.terraform_remote_state.network.outputs.rancher_sg, data.terraform_remote_state.network.outputs.ops_ssh, data.terraform_remote_state.eks_cluster.outputs.security_group_id ]

Enable the Kubernetes cluster auto-scaler to find the auto-scaling group

cluster_autoscaler_enabled = var.cluster_autoscaler_enabled

create_before_destroy = true

node_role_policy_arns = ["arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"]

block_device_mappings = [ { "delete_on_termination" : true, "device_name" : "/dev/xvda", "encrypted" : true, "volume_size" : 90, "volume_type" : "gp3" } ]

node_group_terraform_timeouts = [{ create = "40m" update = "40m" delete = "20m" }]

Valid types are "instance", "volume", "elastic-gpu", "spot-instances-request", "network-interface".

resources_to_tag = var.capacity_type == "SPOT" ? ["instance", "spot-instances-request", "volume", "network-interface"] : ["instance", "volume", "network-interface"] }

Expected Behavior

Node should join the cluster

Steps to Reproduce

If you have existing cluster just try creating the windows node group into that

Screenshots

No response

Environment

No response

Additional Context

No response

ChrisMcKee commented 6 days ago

It's failing because the userscript contains the bootstrapper in the middle; but the script that is stored in the launch template contains the bootstrapper again at the end.

ChrisMcKee commented 5 days ago

The change in how the windows nodes are assigned has caused this. If the ami-type is defined and AWS is supplying the AMI it will show in the console as ami release version

The v2 module was fetching the windows ami so it was being set as 'custom' and showing the ami ala

The first one has the advantage that updates to the AMI show in the console; but AWS automatically augments your Userdata by adding the bootstrapper to the end of your userscript in the launch template. This doesnt show in the state when you do your plan.

It's not a huge issue to work-around but it does make the current user script broken; I assume it does the same for linux too.

If you have a before_cluster_joining_userdata and after_cluster_joining_userdata set and it's not a CUSTOM ami_type AWS will inject the EKSBootstrapScript execution at the end of the userdata.

cloudposse / terraform-aws-eks-node-group

Windows node not joining the eks cluster #195

Describe the Bug

Enable the Kubernetes cluster auto-scaler to find the auto-scaling group

Valid types are "instance", "volume", "elastic-gpu", "spot-instances-request", "network-interface".

Expected Behavior

Steps to Reproduce

Screenshots

Environment

Additional Context