Closed schwichti closed 2 years ago
Hello @schwichti I have 2 clusters running with Windows workloads. One was created a week ago, and the other just few minutes ago. I am not seeing any issues with Windows nodes joining the cluster. Please check the following:
PS C:\Windows\system32> cd c:\Windows\Temp
PS C:\Windows\Temp> type .\InvokeUserdataErrors.log # Should be empty
PS C:\Windows\Temp> type .\InvokeUserdataOutput.log
Initializing AWS default configurations...
Creating/Updating kubeconfig...
Getting cluster information...
Using cluster information to get APIServer Endpoint and Cluster CA.
Initializing default values...
Using EC2 MetaData service to get VPC CIDR Range.
Using cluster information to get Service CIDR.
Creating/Updating EKS CNI plugin config...
Creating/Updating kubelet configuration file...
Registering kublet and kube-proxy services...
Status Name DisplayName
Stopped kubelet kubelet Stopped kube-proxy kube-proxy Generating resolvconf file... Creating resolv directory : c:\etc Unique Dns servers : 10.1.0.2
Actions : {MSFT_TaskExecAction} Author : Date : Description : EKS Windows Startup task Documentation : Principal : MSFT_TaskPrincipal2 SecurityDescriptor : Settings : MSFT_TaskSettings3 Source : State : Ready TaskName : EKS Windows startup task TaskPath : \ Triggers : {MSFT_TaskBootTrigger} URI : \EKS Windows startup task Version : PSComputerName :
3. Check if kubelet and kube-proxy services are running on the instance.
PS C:\Windows\Temp> Get-Service | findstr kube
4. Check if kubelet / kube-proxy are reporting any errors
PS C:\Windows\Temp> Get-EventLog -LogName EKS -Newest 50
1-3. looks good
3517 Jan 17 14:04 Error kube-proxy 0 E0117 14:04:23.333960 2244 utils.go:282] Skipping invalid IP:
This is also my configuration
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "v3.11.0"
name = "${var.infrastructurename}-vpc"
cidr = var.vpcCidr
azs = data.aws_availability_zones.available.names
private_subnets = var.vpcPrivateSubnets
public_subnets = var.vpcPublicSubnets
database_subnets = var.vpcDatabaseSubnets
enable_nat_gateway = true
single_nat_gateway = true
create_igw = true
enable_vpn_gateway = false
create_egress_only_igw = false
create_database_subnet_group = true
create_elasticache_subnet_group = false
create_redshift_subnet_group = false
enable_dns_hostnames = true
enable_dns_support = true
tags = var.tags
public_subnet_tags = {
"kubernetes.io/cluster/${local.eks_cluster_id}" = "shared"
"kubernetes.io/role/elb" = "1"
}
private_subnet_tags = {
"kubernetes.io/cluster/${local.eks_cluster_id}" = "shared"
"kubernetes.io/role/internal-elb" = "1"
}
}
module "eks" {
source = "git::https://github.com/aws-samples/aws-eks-accelerator-for-terraform.git"
tenant = local.tenant
environment = local.environment
zone = local.zone
# EKS CLUSTER
kubernetes_version = var.kubernetesVersion
vpc_id = module.vpc.vpc_id
private_subnet_ids = module.vpc.private_subnets # Enter Private Subnet IDs
create_eks = true
enable_windows_support = true
# EKS MANAGED NODE GROUPS
managed_node_groups = {
"default" = {
node_group_name = "default"
instance_types = [var.linuxNodeSize]
subnet_ids = module.vpc.private_subnets
desired_size = var.linuxNodeCountMin
max_size = var.linuxNodeCountMax
min_size = var.linuxNodeCountMin
},
"execnodes" = {
node_group_name = "execnodes"
instance_types = [var.linuxExecutionNodeSize]
subnet_ids = module.vpc.private_subnets
desired_size = var.linuxExecutionNodeCountMin
max_size = var.linuxExecutionNodeCountMax
min_size = var.linuxExecutionNodeCountMin
}
}
self_managed_node_groups = {
"windows" = {
node_group_name = "windows"
instance_types = [var.windowsNodeSize]
#create_launch_template = true
launch_template_os = "windows"
subnet_ids = module.vpc.public_subnets
desired_size = var.windowsNodeCountMin
min_size = var.windowsNodeCountMin
max_size = var.windowsNodeCountMax
k8s_labels = {
"node.kubernetes.io/os" = "windows"
}
}}
tags = var.tags
}
module "eks-addons" {
source = "github.com/aws-samples/aws-eks-accelerator-for-terraform/modules/kubernetes-addons"
eks_cluster_id = module.eks.eks_cluster_id
# EKS Addons
enable_amazon_eks_vpc_cni = true
enable_amazon_eks_coredns = true
enable_amazon_eks_kube_proxy = true
#K8s Add-ons
enable_aws_load_balancer_controller = true
enable_cluster_autoscaler = true
enable_aws_for_fluentbit = true
enable_ingress_nginx = true
ingress_nginx_helm_config = {values = [templatefile("templates/nginx_values.yaml", {
internal = var.enable_external_oidc? "true": "false",
scheme = var.enable_external_oidc? "internal": "internet-facing"})]
}
depends_on = [module.eks.managed_node_groups]
}
I tried this example here: https://github.com/aws-samples/aws-eks-accelerator-for-terraform/tree/main/examples/5-eks-cluster-with-windows-support.
PS C:\Windows\Temp> type .\InvokeUserdataOutput.log
Initializing AWS default configurations...
Creating/Updating kubeconfig...
Getting cluster information...
PS C:\Windows\Temp> type .\InvokeUserdataErrors.log
Get-EKSCluster : Name resolution failure attempting to reach service in region eu-central-1 (as supplied to the
-Region parameter or from configured shell default).
Unable to connect to the remote server.
Possible causes:
- The region may be incorrectly specified (did you specify an availability zone?).
- The service may not be available in the region.
- No network connectivity.
See https://docs.aws.amazon.com/general/latest/gr/rande.html for the latest service availability across the AWS
regions.
At C:\Program Files\Amazon\EKS\Start-EKSBootstrap.ps1:133 char:26
+ $global:EKSCluster = Get-EKSCluster -Name $EKSClusterName
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (Amazon.PowerShe...KSClusterCmdlet:GetEKSClusterCmdlet) [Get-EKSClust
e
r], InvalidOperationException
+ FullyQualifiedErrorId : System.Exception,Amazon.PowerShell.Cmdlets.EKS.GetEKSClusterCmdlet
Get-Service | findstr kube
returns nothing.
Working with the 5-eks-cluster-with-windows-support I scaled down and up the windows auto scaling group. Now I see a different output:
PS C:\Windows\Temp> type .\InvokeUserdataErrors.log
PS C:\Windows\Temp> type .\InvokeUserdataOutput.log
Initializing AWS default configurations...
Creating/Updating kubeconfig...
Getting cluster information...
Using cluster information to get APIServer Endpoint and Cluster CA.
Initializing default values...
Using EC2 MetaData service to get VPC CIDR Range.
Using cluster information to get Service CIDR.
Creating/Updating EKS CNI plugin config...
Creating/Updating kubelet configuration file...
Registering kublet and kube-proxy services...
Status Name DisplayName
------ ---- -----------
Stopped kubelet kubelet
Stopped kube-proxy kube-proxy
Generating resolvconf file...
Creating resolv directory : c:\etc
Unique Dns servers : 10.1.0.2
Actions : {MSFT_TaskExecAction}
Author :
Date :
Description : EKS Windows Startup task
Documentation :
Principal : MSFT_TaskPrincipal2
SecurityDescriptor :
Settings : MSFT_TaskSettings3
Source :
State : Ready
TaskName : EKS Windows startup task
TaskPath : \
Triggers : {MSFT_TaskBootTrigger}
URI : \EKS Windows startup task
Version :
PSComputerName :
PS C:\Windows\Temp> Get-Service | findstr kube
Running kubelet kubelet
Running kube-proxy kube-proxy
PS C:\Windows\Temp> Get-EventLog -LogName EKS -Newest 50
Index Time EntryType Source InstanceID Message
----- ---- --------- ------ ---------- -------
439 Jan 17 19:56 Information kubelet 0 I0117 19:56:02.061343 2748 reconciler.go:157] "Reconciler: start to sync state"
438 Jan 17 19:56 Information kubelet 0 I0117 19:56:01.749350 2748 kubelet_node_status.go:74] "Successfully registered node" node="ip-10-1-10-24.eu-central-1.compute.internal"
437 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.746608 2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
436 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.639559 2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
435 Jan 17 19:56 Information kubelet 0 I0117 19:56:01.554675 2748 apiserver.go:52] "Watching apiserver"
434 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.532131 2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
433 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.428083 2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
432 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.406853 2748 eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-10-1-10-24.eu-central...
431 Jan 17 19:56 Information kubelet 0 I0117 19:56:01.402726 2748 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
430 Jan 17 19:56 Information kubelet 0 I0117 19:56:01.378783 2748 manager.go:600] "Failed to retrieve checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
429 Jan 17 19:56 Information kubelet 0 I0117 19:56:01.378232 2748 kubelet_node_status.go:71] "Attempting to register node" node="ip-10-1-10-24.eu-central-1.compute.internal"
428 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.321929 2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
427 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.305664 2748 csi_plugin.go:291] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: no...
426 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.211198 2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
425 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.207072 2748 kubelet.go:1873] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
424 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.110928 2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
423 Jan 17 19:56 Information kubelet 0 I0117 19:56:01.004199 2748 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="eu-central-1"
422 Jan 17 19:56 Information kubelet 0 I0117 19:56:01.004199 2748 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="eu...
421 Jan 17 19:56 Information kubelet 0 I0117 19:56:01.004199 2748 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="eu-central-1a"
420 Jan 17 19:56 Information kubelet 0 I0117 19:56:01.004199 2748 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="eu-c...
419 Jan 17 19:56 Information kubelet 0 I0117 19:56:01.004199 2748 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="m5n.large"
418 Jan 17 19:56 Information kubelet 0 I0117 19:56:01.004199 2748 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="m5n.large"
417 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.004199 2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
416 Jan 17 19:56 Error kubelet 0 E0117 19:56:01.002779 2748 kubelet.go:1873] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
415 Jan 17 19:56 Error kubelet 0 E0117 19:56:00.952109 2748 nodelease.go:49] "Failed to get node when trying to set owner ref to the node lease" err="nodes \"ip-10-1-10-24.eu-central-1.compute.i...
414 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.942032 2748 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="eu-central-1"
413 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.942032 2748 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="eu...
412 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.942032 2748 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="eu-central-1a"
411 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.942032 2748 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="eu-c...
410 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.942032 2748 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="m5n.large"
409 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.942032 2748 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="m5n.large"
408 Jan 17 19:56 Information kube-proxy 0 I0117 19:56:00.931852 3748 proxier.go:148] "%s" Hns loadbalancer policy resource="hostComputeLoadBalancer" {"ID":"75799e31-695f-4c5b-9a2d-18420760efb0","HostComp...
407 Jan 17 19:56 Information kube-proxy 0 I0117 19:56:00.923068 3748 proxier.go:148] "%s" Hns Endpoint resource="endpointInfo" {}="(MISSING)"
406 Jan 17 19:56 Information kube-proxy 0 I0117 19:56:00.914060 3748 proxier.go:148] "%s" Hns Endpoint resource="endpointInfo" {}="(MISSING)"
405 Jan 17 19:56 Information kube-proxy 0 I0117 19:56:00.907091 3748 proxier.go:148] "%s" Hns loadbalancer policy resource="hostComputeLoadBalancer" {"ID":"b4efc3f3-11a0-4968-b256-32710fb363d8","HostComp...
404 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.905671 2748 desired_state_of_world_populator.go:141] "Desired state populator starts to run"
403 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.902839 2748 volume_manager.go:271] "Starting Kubelet Volume Manager"
402 Jan 17 19:56 Error kubelet 0 E0117 19:56:00.902274 2748 kubelet.go:1873] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: ...
401 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.902274 2748 kubelet.go:1849] "Starting kubelet main sync loop"
400 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.902274 2748 status_manager.go:157] "Starting to sync pod status with apiserver"
399 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.897041 2748 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
398 Jan 17 19:56 Information kube-proxy 0 I0117 19:56:00.897041 3748 proxier.go:148] "%s" Hns Endpoint resource="endpointInfo" {}="(MISSING)"
397 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.879151 2748 server.go:409] "Adding debug handlers to kubelet server"
396 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.879151 2748 server.go:149] "Starting to listen" address="0.0.0.0" port=10250
395 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.876031 2748 server.go:1190] "Started kubelet"
394 Jan 17 19:56 Error kubelet 0 E0117 19:56:00.876031 2748 server.go:1179] "Failed to set rlimit on max file handles" err="SetRLimit unsupported in this platform"
393 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.875470 2748 plugins.go:639] Loaded volume plugin "kubernetes.io/csi"
392 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.875470 2748 plugins.go:639] Loaded volume plugin "kubernetes.io/storageos"
391 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.875470 2748 plugins.go:639] Loaded volume plugin "kubernetes.io/local-volume"
390 Jan 17 19:56 Information kubelet 0 I0117 19:56:00.875470 2748 plugins.go:639] Loaded volume plugin "kubernetes.io/scaleio"
Indeed, the windows node is now added to the example cluster even with the errors. But I still do not know why it does not work for the real cluster.
I changed my vpc settings like this and now the windows nodes join the cluster. However, I still do not know what is the problem.
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "v3.11.0"
name = "${var.infrastructurename}-vpc"
- cidr = var.vpcCidr //10.1.0.0/18
+cidr = var.vpcCidr //10.1.0.0/16
azs = data.aws_availability_zones.available.names
- private_subnets = var.vpcPrivateSubnets
+private_subnets =[for k, v in data.aws_availability_zones.available.names : cidrsubnet(var.vpcCidr, 8, local.zones+k)]
- public_subnets = var.vpcPublicSubnets
+public_subnets =[for k, v in data.aws_availability_zones.available.names : cidrsubnet(var.vpcCidr, 8, k)]
- database_subnets = var.vpcDatabaseSubnets
+database_subnets = [for k, v in data.aws_availability_zones.available.names : cidrsubnet(var.vpcCidr, 8, local.zones*2+k)]
enable_nat_gateway = true
single_nat_gateway = true
create_igw = true
- enable_vpn_gateway = false
- create_egress_only_igw = false
- create_database_subnet_group = true
- create_elasticache_subnet_group = false
- create_redshift_subnet_group = false
enable_dns_hostnames = true
- enable_dns_support = true
It looks like you used a different version of the Windows support example as your VPC settings were different from what's in the example. Not sure if you used correct settings for CIDR and subnet IDs. From the error log, it's clear there was some networking issue in your earlier VPC. Anyway, glad to see that you've got the example working. We'll close the issue unless you need additional help.
These were the settings for the earlier VPC:
cidr = "10.1.0.0/18"
private_subnets = ["10.1.0.0/22", "10.1.4.0/22", "10.1.8.0/22"]
public_subnets = ["10.1.12.0/22", "10.1.16.0/22", "10.1.20.0/22"]
database_subnets = ["10.1.24.0/22","10.1.28.0/22","10.1.32.0/22"]
I cannot see something wrong with it.
What would be the effect when CLUSTERNAME would have a wrong value here:
private_subnet_tags = {"kubernetes.io/role/internal-elb" =1, "kubernetes.io/cluster/CLUSTERNAME"= "shared"}
The tag kubernetes.io/cluster/clustername
was used by EKS versions 1.18 or earlier. It should not affect EKS node placement in v1.21. It may be used by the AWS Load Balancer controller if there are multiple AWS services sharing the subnets or multiple EKS clusters are deployed in the same subnets. However, AWS LB controller failures would not prevent EKS nodes from joining a cluster.
How did you set region for the AWS provider? Was that data "aws_region" "current" {}
or something else?
There is always a possibility of a temporary network anomaly affecting the startup of an instance. The logs indicate that either the Windows EC2 instance's instance metadata service was unable to return the correct region, or the Powershell cmdlet was unable connect to the EKS service due to a DNS issue. If all of your settings were correct, I would check if there were any outages or service disruptions reported by AWS in the region / AZ where your node was located, at the time of instance startup. You may want to open a support ticket with AWS to dig deeper.
May it be possible that the windows nodes needs to be rebooted before they join the cluster? A collegue of mine confirms the problems with joining windows nodes.
No, rebooting is not necessary from my experience. Also, if the user data script failed as it did for you, rebooting will not help, as the script will not be executed upon reboot. In that case, you can run the script manually with appropriate parameter substitutions to diagnose the issue or terminate the nodes so that new nodes will be spun up.
Rebooting just worked for me to let the node join the cluster.
running the following command on the windows node also helps
Start-Service kubelet
Here are some errors and warnings from the log of my collegue who observes the same problem:
306 Error kube-proxy 0 E0120 14:07:26.400355 2620 utils.go:282] Skipping invalid IP:
298 Warning kube-proxy 0 W0120 14:07:26.381624 2620 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1
EndpointSlice
296 Error kubelet 0 E0120 14:07:22.959495 3984 server.go:292] "Failed to run kubelet" err="failed to run Kubelet: could not init cloud provider \"aws\": unable to determine AWS
zone from cloud provider config or EC2 instance metadata: RequestError: send request failed\ncaused by: Get
\"http://*********/latest/meta-data/placement/availability-zone\": dial tcp **********: connectex: An established connection was aborted by the
software in your host machine."
293 Warning kubelet 0 W0120 14:07:19.018725 3984 plugins.go:105] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will be removed in a
future release
288 Error kubelet 0 E0120 14:07:19.009800 3984 server.go:277] "kubelet running with insufficient permissions" err="Error while checking admin group membership: Error retrieving
group ids: The user name could not be found."
75 Warning kube-proxy 0 W0120 14:07:18.769678 2620 server.go:220] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config
file ASAP.
@schwichti @awsitcloudpro It would be nice to add the troubleshooting tips or findings to the Windows example as a comment and close this issue.
This situation is is unsatisfying to me, because I do not know what is going on. The solution is reboot the instance/restart the kubelet. If there "is always a possibility of a temporary network anomaly affecting the startup of an instance" can't we implement automatic retries?
I do not experience this problem anymore. So I guess this issue can be closed. However, implementing retries could still be beneficial...
Hi, I followed this example https://github.com/aws-samples/aws-eks-accelerator-for-terraform/tree/main/examples/5-eks-cluster-with-windows-support to add windows nodes to my eks cluster. In fact, the respective auto scaling group has one instance, but it is not available in my eks cluster. What could be the problem?