aws-ia / terraform-aws-eks-blueprints

Configure and deploy complete EKS clusters.
https://aws-ia.github.io/terraform-aws-eks-blueprints/
Apache License 2.0
2.69k stars 1.42k forks source link

self-managed windows nodes don't join EKS cluster #193

Closed schwichti closed 2 years ago

schwichti commented 2 years ago

Hi, I followed this example https://github.com/aws-samples/aws-eks-accelerator-for-terraform/tree/main/examples/5-eks-cluster-with-windows-support to add windows nodes to my eks cluster. In fact, the respective auto scaling group has one instance, but it is not available in my eks cluster. What could be the problem?

awsitcloudpro commented 2 years ago

Hello @schwichti I have 2 clusters running with Windows workloads. One was created a week ago, and the other just few minutes ago. I am not seeing any issues with Windows nodes joining the cluster. Please check the following:

  1. Login to the Windows instance. (Note: The latest version of this repo attaches AmazonSSMManagedInstanceCore policy to the nodes' IAM role, which was missing earlier. If you're not using the latest version, attach the policy manually to the IAM role of the Windows node so that you can use SSM to login to the instance.)
  2. View the contents of C:\Windows\Temp\InvokeUserdataErrors.log and C:\Windows\Temp\InvokeUserdataOutput.log. Are there any errors? The following are expected contents of these files:
    
    PS C:\Windows\system32> cd c:\Windows\Temp
    PS C:\Windows\Temp> type .\InvokeUserdataErrors.log # Should be empty
    PS C:\Windows\Temp> type .\InvokeUserdataOutput.log
    Initializing AWS default configurations...
    Creating/Updating kubeconfig...
    Getting cluster information...
    Using cluster information to get APIServer Endpoint and Cluster CA.
    Initializing default values...
    Using EC2 MetaData service to get VPC CIDR Range.
    Using cluster information to get Service CIDR.
    Creating/Updating EKS CNI plugin config...
    Creating/Updating kubelet configuration file...
    Registering kublet and kube-proxy services...

Status Name DisplayName


Stopped kubelet kubelet Stopped kube-proxy kube-proxy Generating resolvconf file... Creating resolv directory : c:\etc Unique Dns servers : 10.1.0.2

Actions : {MSFT_TaskExecAction} Author : Date : Description : EKS Windows Startup task Documentation : Principal : MSFT_TaskPrincipal2 SecurityDescriptor : Settings : MSFT_TaskSettings3 Source : State : Ready TaskName : EKS Windows startup task TaskPath : \ Triggers : {MSFT_TaskBootTrigger} URI : \EKS Windows startup task Version : PSComputerName :


3. Check if kubelet and kube-proxy services are running on the instance.

PS C:\Windows\Temp> Get-Service | findstr kube

4. Check if kubelet / kube-proxy are reporting any errors

PS C:\Windows\Temp> Get-EventLog -LogName EKS -Newest 50

schwichti commented 2 years ago

1-3. looks good

  1. shows me the following error:
3517 Jan 17 14:04  Error       kube-proxy                      0 E0117 14:04:23.333960    2244 utils.go:282] Skipping invalid IP:
schwichti commented 2 years ago

This is also my configuration


module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  version    = "v3.11.0"

  name                            = "${var.infrastructurename}-vpc"
  cidr                            = var.vpcCidr
  azs                             = data.aws_availability_zones.available.names
  private_subnets                 = var.vpcPrivateSubnets
  public_subnets                  = var.vpcPublicSubnets
  database_subnets                = var.vpcDatabaseSubnets
  enable_nat_gateway              = true
  single_nat_gateway              = true
  create_igw                      = true
  enable_vpn_gateway              = false
  create_egress_only_igw          = false
  create_database_subnet_group    = true
  create_elasticache_subnet_group = false
  create_redshift_subnet_group    = false
  enable_dns_hostnames            = true
  enable_dns_support              = true
  tags = var.tags

  public_subnet_tags = {
    "kubernetes.io/cluster/${local.eks_cluster_id}" = "shared"
    "kubernetes.io/role/elb"                        = "1"
  }
  private_subnet_tags = {
    "kubernetes.io/cluster/${local.eks_cluster_id}" = "shared"
    "kubernetes.io/role/internal-elb"               = "1"
  }

}

module "eks" {
    source = "git::https://github.com/aws-samples/aws-eks-accelerator-for-terraform.git"

    tenant            = local.tenant
    environment       = local.environment
    zone              = local.zone

    # EKS CLUSTER
    kubernetes_version       = var.kubernetesVersion
    vpc_id             = module.vpc.vpc_id
    private_subnet_ids = module.vpc.private_subnets   # Enter Private Subnet IDs

    create_eks = true

    enable_windows_support = true

    # EKS MANAGED NODE GROUPS

    managed_node_groups = {
        "default" = {
            node_group_name = "default"
            instance_types  = [var.linuxNodeSize]
            subnet_ids      = module.vpc.private_subnets
            desired_size    = var.linuxNodeCountMin
            max_size        = var.linuxNodeCountMax
            min_size        = var.linuxNodeCountMin
        },
        "execnodes" = {
            node_group_name = "execnodes"
            instance_types  = [var.linuxExecutionNodeSize]
            subnet_ids      = module.vpc.private_subnets
            desired_size    = var.linuxExecutionNodeCountMin
            max_size        = var.linuxExecutionNodeCountMax
            min_size        = var.linuxExecutionNodeCountMin

        }
    }

    self_managed_node_groups = {
    "windows" = {
            node_group_name = "windows"
            instance_types  = [var.windowsNodeSize]
            #create_launch_template      = true
            launch_template_os          = "windows"
            subnet_ids      = module.vpc.public_subnets
            desired_size    = var.windowsNodeCountMin
            min_size        = var.windowsNodeCountMin
            max_size        = var.windowsNodeCountMax

            k8s_labels = {
                "node.kubernetes.io/os" = "windows"
            }

        }}

    tags = var.tags
}

module "eks-addons" {
    source = "github.com/aws-samples/aws-eks-accelerator-for-terraform/modules/kubernetes-addons"

    eks_cluster_id                        = module.eks.eks_cluster_id

    # EKS Addons
    enable_amazon_eks_vpc_cni             = true
    enable_amazon_eks_coredns             = true
    enable_amazon_eks_kube_proxy          = true

    #K8s Add-ons
    enable_aws_load_balancer_controller   = true
    enable_cluster_autoscaler             = true
    enable_aws_for_fluentbit              = true
    enable_ingress_nginx                  = true
    ingress_nginx_helm_config             =  {values = [templatefile("templates/nginx_values.yaml", {
        internal = var.enable_external_oidc? "true": "false", 
        scheme = var.enable_external_oidc? "internal": "internet-facing"})]
    }

    depends_on = [module.eks.managed_node_groups]
}
schwichti commented 2 years ago

I tried this example here: https://github.com/aws-samples/aws-eks-accelerator-for-terraform/tree/main/examples/5-eks-cluster-with-windows-support.

PS C:\Windows\Temp> type .\InvokeUserdataOutput.log
Initializing AWS default configurations...
Creating/Updating kubeconfig...
Getting cluster information...

PS C:\Windows\Temp> type .\InvokeUserdataErrors.log
Get-EKSCluster : Name resolution failure attempting to reach service in region eu-central-1 (as supplied to the
-Region parameter or from configured shell default).
Unable to connect to the remote server.
Possible causes:
        - The region may be incorrectly specified (did you specify an availability zone?).
        - The service may not be available in the region.
        - No network connectivity.
See https://docs.aws.amazon.com/general/latest/gr/rande.html for the latest service availability across the AWS
regions.
At C:\Program Files\Amazon\EKS\Start-EKSBootstrap.ps1:133 char:26
+     $global:EKSCluster = Get-EKSCluster -Name $EKSClusterName
+                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (Amazon.PowerShe...KSClusterCmdlet:GetEKSClusterCmdlet) [Get-EKSClust
e
   r], InvalidOperationException
    + FullyQualifiedErrorId : System.Exception,Amazon.PowerShell.Cmdlets.EKS.GetEKSClusterCmdlet
Get-Service | findstr kube

returns nothing.

schwichti commented 2 years ago

Working with the 5-eks-cluster-with-windows-support I scaled down and up the windows auto scaling group. Now I see a different output:


PS C:\Windows\Temp> type .\InvokeUserdataErrors.log
PS C:\Windows\Temp> type .\InvokeUserdataOutput.log
Initializing AWS default configurations...
Creating/Updating kubeconfig...
Getting cluster information...
Using cluster information to get APIServer Endpoint and Cluster CA.
Initializing default values...
Using EC2 MetaData service to get VPC CIDR Range.
Using cluster information to get Service CIDR.
Creating/Updating EKS CNI plugin config...
Creating/Updating kubelet configuration file...
Registering kublet and kube-proxy services...

Status   Name               DisplayName
------   ----               -----------
Stopped  kubelet            kubelet
Stopped  kube-proxy         kube-proxy
Generating resolvconf file...
Creating resolv directory : c:\etc
Unique Dns servers : 10.1.0.2

Actions            : {MSFT_TaskExecAction}
Author             :
Date               :
Description        : EKS Windows Startup task
Documentation      :
Principal          : MSFT_TaskPrincipal2
SecurityDescriptor :
Settings           : MSFT_TaskSettings3
Source             :
State              : Ready
TaskName           : EKS Windows startup task
TaskPath           : \
Triggers           : {MSFT_TaskBootTrigger}
URI                : \EKS Windows startup task
Version            :
PSComputerName     :

PS C:\Windows\Temp> Get-Service | findstr kube
Running  kubelet            kubelet
Running  kube-proxy         kube-proxy
PS C:\Windows\Temp> Get-EventLog -LogName EKS -Newest 50

   Index Time          EntryType   Source                 InstanceID Message
   ----- ----          ---------   ------                 ---------- -------
     439 Jan 17 19:56  Information kubelet                         0 I0117 19:56:02.061343    2748 reconciler.go:157] "Reconciler: start to sync state"
     438 Jan 17 19:56  Information kubelet                         0 I0117 19:56:01.749350    2748 kubelet_node_status.go:74] "Successfully registered node" node="ip-10-1-10-24.eu-central-1.compute.internal"
     437 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.746608    2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
     436 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.639559    2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
     435 Jan 17 19:56  Information kubelet                         0 I0117 19:56:01.554675    2748 apiserver.go:52] "Watching apiserver"
     434 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.532131    2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
     433 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.428083    2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
     432 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.406853    2748 eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-10-1-10-24.eu-central...
     431 Jan 17 19:56  Information kubelet                         0 I0117 19:56:01.402726    2748 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
     430 Jan 17 19:56  Information kubelet                         0 I0117 19:56:01.378783    2748 manager.go:600] "Failed to retrieve checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
     429 Jan 17 19:56  Information kubelet                         0 I0117 19:56:01.378232    2748 kubelet_node_status.go:71] "Attempting to register node" node="ip-10-1-10-24.eu-central-1.compute.internal"
     428 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.321929    2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
     427 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.305664    2748 csi_plugin.go:291] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: no...
     426 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.211198    2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
     425 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.207072    2748 kubelet.go:1873] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
     424 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.110928    2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
     423 Jan 17 19:56  Information kubelet                         0 I0117 19:56:01.004199    2748 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="eu-central-1"
     422 Jan 17 19:56  Information kubelet                         0 I0117 19:56:01.004199    2748 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="eu...
     421 Jan 17 19:56  Information kubelet                         0 I0117 19:56:01.004199    2748 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="eu-central-1a"
     420 Jan 17 19:56  Information kubelet                         0 I0117 19:56:01.004199    2748 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="eu-c...
     419 Jan 17 19:56  Information kubelet                         0 I0117 19:56:01.004199    2748 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="m5n.large"
     418 Jan 17 19:56  Information kubelet                         0 I0117 19:56:01.004199    2748 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="m5n.large"
     417 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.004199    2748 kubelet.go:2294] "Error getting node" err="node \"ip-10-1-10-24.eu-central-1.compute.internal\" not found"
     416 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:01.002779    2748 kubelet.go:1873] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
     415 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:00.952109    2748 nodelease.go:49] "Failed to get node when trying to set owner ref to the node lease" err="nodes \"ip-10-1-10-24.eu-central-1.compute.i...
     414 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.942032    2748 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="eu-central-1"
     413 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.942032    2748 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="eu...
     412 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.942032    2748 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="eu-central-1a"
     411 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.942032    2748 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="eu-c...
     410 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.942032    2748 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="m5n.large"
     409 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.942032    2748 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="m5n.large"
     408 Jan 17 19:56  Information kube-proxy                      0 I0117 19:56:00.931852    3748 proxier.go:148] "%s" Hns loadbalancer policy resource="hostComputeLoadBalancer" {"ID":"75799e31-695f-4c5b-9a2d-18420760efb0","HostComp...
     407 Jan 17 19:56  Information kube-proxy                      0 I0117 19:56:00.923068    3748 proxier.go:148] "%s" Hns Endpoint resource="endpointInfo" {}="(MISSING)"
     406 Jan 17 19:56  Information kube-proxy                      0 I0117 19:56:00.914060    3748 proxier.go:148] "%s" Hns Endpoint resource="endpointInfo" {}="(MISSING)"
     405 Jan 17 19:56  Information kube-proxy                      0 I0117 19:56:00.907091    3748 proxier.go:148] "%s" Hns loadbalancer policy resource="hostComputeLoadBalancer" {"ID":"b4efc3f3-11a0-4968-b256-32710fb363d8","HostComp...
     404 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.905671    2748 desired_state_of_world_populator.go:141] "Desired state populator starts to run"
     403 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.902839    2748 volume_manager.go:271] "Starting Kubelet Volume Manager"
     402 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:00.902274    2748 kubelet.go:1873] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: ...
     401 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.902274    2748 kubelet.go:1849] "Starting kubelet main sync loop"
     400 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.902274    2748 status_manager.go:157] "Starting to sync pod status with apiserver"
     399 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.897041    2748 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
     398 Jan 17 19:56  Information kube-proxy                      0 I0117 19:56:00.897041    3748 proxier.go:148] "%s" Hns Endpoint resource="endpointInfo" {}="(MISSING)"
     397 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.879151    2748 server.go:409] "Adding debug handlers to kubelet server"
     396 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.879151    2748 server.go:149] "Starting to listen" address="0.0.0.0" port=10250
     395 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.876031    2748 server.go:1190] "Started kubelet"
     394 Jan 17 19:56  Error       kubelet                         0 E0117 19:56:00.876031    2748 server.go:1179] "Failed to set rlimit on max file handles" err="SetRLimit unsupported in this platform"
     393 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.875470    2748 plugins.go:639] Loaded volume plugin "kubernetes.io/csi"
     392 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.875470    2748 plugins.go:639] Loaded volume plugin "kubernetes.io/storageos"
     391 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.875470    2748 plugins.go:639] Loaded volume plugin "kubernetes.io/local-volume"
     390 Jan 17 19:56  Information kubelet                         0 I0117 19:56:00.875470    2748 plugins.go:639] Loaded volume plugin "kubernetes.io/scaleio"
schwichti commented 2 years ago

Indeed, the windows node is now added to the example cluster even with the errors. But I still do not know why it does not work for the real cluster.

schwichti commented 2 years ago

I changed my vpc settings like this and now the windows nodes join the cluster. However, I still do not know what is the problem.

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  version    = "v3.11.0"

  name                            = "${var.infrastructurename}-vpc"
- cidr                            = var.vpcCidr //10.1.0.0/18
+cidr                            = var.vpcCidr //10.1.0.0/16
  azs                             = data.aws_availability_zones.available.names
- private_subnets                 = var.vpcPrivateSubnets
+private_subnets                 =[for k, v in data.aws_availability_zones.available.names : cidrsubnet(var.vpcCidr, 8, local.zones+k)]
- public_subnets                  = var.vpcPublicSubnets
+public_subnets                  =[for k, v in data.aws_availability_zones.available.names : cidrsubnet(var.vpcCidr, 8, k)]
- database_subnets                = var.vpcDatabaseSubnets
+database_subnets                 = [for k, v in data.aws_availability_zones.available.names : cidrsubnet(var.vpcCidr, 8, local.zones*2+k)]
  enable_nat_gateway              = true
  single_nat_gateway              = true
  create_igw                      = true
- enable_vpn_gateway              = false
- create_egress_only_igw          = false
- create_database_subnet_group    = true
- create_elasticache_subnet_group = false
- create_redshift_subnet_group    = false
  enable_dns_hostnames            = true
-  enable_dns_support              = true
awsitcloudpro commented 2 years ago

It looks like you used a different version of the Windows support example as your VPC settings were different from what's in the example. Not sure if you used correct settings for CIDR and subnet IDs. From the error log, it's clear there was some networking issue in your earlier VPC. Anyway, glad to see that you've got the example working. We'll close the issue unless you need additional help.

schwichti commented 2 years ago

These were the settings for the earlier VPC:

  cidr                 = "10.1.0.0/18" 
  private_subnets      = ["10.1.0.0/22", "10.1.4.0/22", "10.1.8.0/22"]
  public_subnets       = ["10.1.12.0/22", "10.1.16.0/22", "10.1.20.0/22"] 
  database_subnets    = ["10.1.24.0/22","10.1.28.0/22","10.1.32.0/22"] 

I cannot see something wrong with it.

schwichti commented 2 years ago

What would be the effect when CLUSTERNAME would have a wrong value here:

private_subnet_tags = {"kubernetes.io/role/internal-elb" =1, "kubernetes.io/cluster/CLUSTERNAME"= "shared"}
awsitcloudpro commented 2 years ago

The tag kubernetes.io/cluster/clustername was used by EKS versions 1.18 or earlier. It should not affect EKS node placement in v1.21. It may be used by the AWS Load Balancer controller if there are multiple AWS services sharing the subnets or multiple EKS clusters are deployed in the same subnets. However, AWS LB controller failures would not prevent EKS nodes from joining a cluster. How did you set region for the AWS provider? Was that data "aws_region" "current" {} or something else?

There is always a possibility of a temporary network anomaly affecting the startup of an instance. The logs indicate that either the Windows EC2 instance's instance metadata service was unable to return the correct region, or the Powershell cmdlet was unable connect to the EKS service due to a DNS issue. If all of your settings were correct, I would check if there were any outages or service disruptions reported by AWS in the region / AZ where your node was located, at the time of instance startup. You may want to open a support ticket with AWS to dig deeper.

schwichti commented 2 years ago

May it be possible that the windows nodes needs to be rebooted before they join the cluster? A collegue of mine confirms the problems with joining windows nodes.

awsitcloudpro commented 2 years ago

No, rebooting is not necessary from my experience. Also, if the user data script failed as it did for you, rebooting will not help, as the script will not be executed upon reboot. In that case, you can run the script manually with appropriate parameter substitutions to diagnose the issue or terminate the nodes so that new nodes will be spun up.

schwichti commented 2 years ago

Rebooting just worked for me to let the node join the cluster.

schwichti commented 2 years ago

running the following command on the windows node also helps

Start-Service kubelet
schwichti commented 2 years ago

Here are some errors and warnings from the log of my collegue who observes the same problem:


306            Error kube-proxy          0 E0120 14:07:26.400355    2620 utils.go:282] Skipping invalid IP:
  298          Warning kube-proxy          0 W0120 14:07:26.381624    2620 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1
                                             EndpointSlice
  296            Error kubelet             0 E0120 14:07:22.959495    3984 server.go:292] "Failed to run kubelet" err="failed to run Kubelet: could not init cloud provider \"aws\": unable to determine AWS
                                             zone from cloud provider config or EC2 instance metadata: RequestError: send request failed\ncaused by: Get
                                             \"http://*********/latest/meta-data/placement/availability-zone\": dial tcp **********: connectex: An established connection was aborted by the
                                             software in your host machine."
  293          Warning kubelet             0 W0120 14:07:19.018725    3984 plugins.go:105] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will be removed in a
                                             future release
  288            Error kubelet             0 E0120 14:07:19.009800    3984 server.go:277] "kubelet running with insufficient permissions" err="Error while checking admin group membership: Error retrieving
                                             group ids: The user name could not be found."
   75          Warning kube-proxy          0 W0120 14:07:18.769678    2620 server.go:220] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config
                                             file ASAP.
vara-bonthu commented 2 years ago

@schwichti @awsitcloudpro It would be nice to add the troubleshooting tips or findings to the Windows example as a comment and close this issue.

schwichti commented 2 years ago

This situation is is unsatisfying to me, because I do not know what is going on. The solution is reboot the instance/restart the kubelet. If there "is always a possibility of a temporary network anomaly affecting the startup of an instance" can't we implement automatic retries?

schwichti commented 2 years ago

I do not experience this problem anymore. So I guess this issue can be closed. However, implementing retries could still be beneficial...