aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 320 forks source link

[EKS] [request]: Improve Health checking if Managed Nodegroup failed to join EKS cluster at creation stage #764

Open 0xlen opened 4 years ago

0xlen commented 4 years ago

Community Note

Tell us about your request Currently, Managed Nodegroup doesn't check the Health status again if instances failed to join EKS cluster at creation stage. Hope it can improve the Health Issue checking(Like 5 minutes checking again), or, improve the error message to mention about user have to delete nodegroup failed at creation then create it again.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? When the first time I created the managed nodegroup, for some reason I can aware my managed nodegroup unable to join the EKS cluster and it shows error Instances failed to join the kubernetes cluster in the Health Issue.

After fixing the problem, I notice that the status won't change again even the instances correctly join the k8s cluster and able to schedule Pods. It makes me confused.

Are you currently working around this issue? The status unable to be changed. You have to delete the managed nodegroup and recreate a new one.

How to reproduce

You can follow steps to replicate the issue:

[Configurations]
------------------------------
# Cluster configurations
CLUSTER=test
CLUSTER_SG=$(aws eks describe-cluster --name $CLUSTER --region us-west-2 | jq -r ".cluster.resourcesVpcConfig.clusterSecurityGroupId")

# Nodegroup configurations
SUBNETS="subnet-AAAAAAAAAA subnet-BBBBBBBB subnet-CCCCCCCC"
NODE_ROLE=arn:aws:iam::XXXXXXXX:role/EKSManagedNodeWorkerRole
NODE_GROUP_NAME=box
------------------------------

1) Temporary remove the connectivity of security group used by control plane and nodegroup

$ aws ec2 revoke-security-group-egress --group-id $CLUSTER_SG --protocol all --port all --cidr 0.0.0.0/0 --region us-west-2

2) Create managed nodegroup

$ aws eks create-nodegroup --cluster-name $CLUSTER --nodegroup-name $NODE_GROUP_NAME --subnets $SUBNETS --node-role $NODE_ROLE --region us-west-2

3) Monitor the nodegroup, wait about ~10minutes, will see the error message

$ while true; do aws eks describe-nodegroup --cluster-name $CLUSTER --nodegroup-name $NODE_GROUP_NAME --region us-west-2 | jq -r ".nodegroup.health.issues"; sleep 3; done

    [
      {
        "resourceIds": [
          "i-AAAAAAAAAAAAAA",
          "i-BBBBBBBBBBBBBB"
        ],
        "message": "Instances failed to join the kubernetes cluster",
        "code": "NodeCreationFailure"
      }
    ]

4) Then, recover the netework setting

$ aws ec2 authorize-security-group-egress --group-id $CLUSTER_SG --protocol all --port all --cidr 0.0.0.0/0 --region us-west-2

5) Can see the Nodes and become Ready

$ kubectl get nodes
NAME                                           STATUS   ROLES    AGE     VERSION
ip-192-168-57-130.us-west-2.compute.internal   Ready    <none>   2m14s   v1.14.8-eks-b8860f
ip-192-168-68-87.us-west-2.compute.internal    Ready    <none>   2m11s   v1.14.8-eks-b8860f

6) The Health Issue still show the error and status did not flip out like ACTIVE or other status (Keep to be CREATE_FAILED)

$ aws eks describe-nodegroup --cluster-name $CLUSTER --nodegroup-name $NODE_GROUP_NAME --region us-west-2
{
    "nodegroup": {
        "status": "CREATE_FAILED",
        ...
        "nodegroupName": "box",
        "nodegroupArn": "arn:aws:eks:us-west-2:XXXXXXXX:nodegroup/test/box/fcb83117-9467-b7bf-e6e6-XXXXXXXX",
        "health": {
            "issues": [
                {
                    "resourceIds": [
                        "i-AAAAAAAAAAAAAA",
                        "i-BBBBBBBBBBBBBB"
                    ],
                    "message": "Instances failed to join the kubernetes cluster",
                    "code": "NodeCreationFailure"
                }
            ]
        },
        ...
    }
}
nhannguyensy commented 2 years ago

I have had the same issue. voted .

spkane commented 2 years ago

I have also seen instances where a managed node group creation starts, and the EC2 nodes show up in the EKS cluster, but the node group still spins for 20 more minutes and then reports a "Create failed" error message, that never gets cleared or updated.

I have a support issue in investigating the underlying cause of the failure.

diogodafs commented 2 years ago

Voted as well

Do you know any way to force the refresh state manually for the node group without the need of recreating it? I've tried through terraform, aws eks cli, kubectl... Would AWS support be able to do it?

rongsheng-fang commented 2 years ago

Can we get some traction on this issue?

We are also experiencing this issue. We originally had some issues creating the node groups due issues with vCPU limits, but the node groups were created and have been functioning since we solved those issues, except that they are still in CREATE_FAILED status. We need a way to change their status to ACTIVE without re-creating the node groups.