Open 0xlen opened 4 years ago
I have had the same issue. voted .
I have also seen instances where a managed node group creation starts, and the EC2 nodes show up in the EKS cluster, but the node group still spins for 20 more minutes and then reports a "Create failed" error message, that never gets cleared or updated.
I have a support issue in investigating the underlying cause of the failure.
Voted as well
Do you know any way to force the refresh state manually for the node group without the need of recreating it? I've tried through terraform, aws eks cli, kubectl... Would AWS support be able to do it?
Can we get some traction on this issue?
We are also experiencing this issue. We originally had some issues creating the node groups due issues with vCPU limits, but the node groups were created and have been functioning since we solved those issues, except that they are still in CREATE_FAILED
status. We need a way to change their status to ACTIVE
without re-creating the node groups.
Community Note
Tell us about your request Currently, Managed Nodegroup doesn't check the Health status again if instances failed to join EKS cluster at creation stage. Hope it can improve the Health Issue checking(Like
5 minutes
checking again), or, improve the error message to mention about user have to delete nodegroup failed at creation then create it again.Which service(s) is this request for? EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? When the first time I created the managed nodegroup, for some reason I can aware my managed nodegroup unable to join the EKS cluster and it shows error
Instances failed to join the kubernetes cluster
in the Health Issue.After fixing the problem, I notice that the status won't change again even the instances correctly join the k8s cluster and able to schedule Pods. It makes me confused.
Are you currently working around this issue? The status unable to be changed. You have to delete the managed nodegroup and recreate a new one.
How to reproduce
You can follow steps to replicate the issue:
1) Temporary remove the connectivity of security group used by control plane and nodegroup
2) Create managed nodegroup
3) Monitor the nodegroup, wait about ~10minutes, will see the error message
4) Then, recover the netework setting
5) Can see the Nodes and become
Ready
6) The Health Issue still show the error and status did not flip out like ACTIVE or other status (Keep to be CREATE_FAILED)