Closed matti closed 2 years ago
maybe: not related to cluster largeness, but 64core and 96core instances seem to fail to register even if they are nitro (#1935) and non-metal (#1929)
confirmed! Now when I used only 16cpu, 32cpu and 48cpu instances this didn't happen and I was able to scale to 675 instances and run 15788 pods with request.cpu=1, so ~15788 cores until my quota ran out.
@Matti thanks for checking. Can you please share the logs bundle on the failed node. You can email it to - k8s-awscni-triage@amazon.com. Also just to confirm aws-node never recovers?
@matti Based on the below logs you shared, VPC CNI isn't able to talk to API Server and this is preventing it from initializing correctly. Until CNI can validate API Server connectivity, it will not copy over the CNI config file to /etc/cni/net.d
and CNI binaries to /opt/cni/bin
resulting in the kubelet
error messages you saw regarding Networking not being initialized
.
{"level":"info","ts":"2022-03-20T21:41:58.260Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD ..."}
{"level":"info","ts":"2022-03-20T21:41:58.260Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
We need to check why API Server connectivity is broken for these instances. Also, kubelet
isn't able to obtain the node info. kubelet
will derive the node Info and validates that the nodeIP provided to it via it's config file indeed belongs to the current instance. Looks like it is not able to do that..permission issues?
21:34:43.175678 4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
Also did kube-proxy start up fine on this node?
Looks like it is not able to do that..permission issues?
@achevuru no idea how it could be permission issue since other nodepools work fine, the only thing that is different is that the problematic nodepools use instances of 64cpu/128gb and 96cpu/192gb that ec2-instance-selector returns here: https://github.com/matti/eksler/blob/e4eba9f1b2bbb5d46930ceaa7d0cdeb520777966/bin/eksler#L348-L357
Also did kube-proxy start up fine on this node?
@jayanthvn in my opinion it did, how can I be sure?
Can you please share the logs bundle on the failed node.
I'll post logs maybe tomorrow when I have time again to try this out. Meanwhile I'm pretty sure that you can also replicate this if you start similar clusters 1-3 times and it should happen. eu-north-1 is the region I'm using.
also, I'm constantly creating/deleting clusters and when I use only 8,16,32,48 core instances I have 0 problems.
@matti If you just create a small Nodegroup with either 64cpu/128gb and 96cpu/192gb instances, does it fail? Or do you see these failures only when you scale beyond a certain limit?
@matti Tried out IPv6 Node groups with both 64 vcpu (c5a.16xlarge) and 96 vcpu (c5a.24xlarge) instances and they came up fine for me. Pod scheduling worked as intended as well. We might've to check both kubelet
and aws-node
(CNI) logs in your case to see what might be contributing to the problem.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
Issue closed due to inactivity.
What happened:
Launched 12,000 pods and got ~600 instances and first nodes started okay, but after kubernetes master started upgrading (connectivity issues) then some/none did register on EKS, but failed to start networking. Tried deleting all not-ready nodes but the replacement nodes also failed to start networking.
Attach logs
eksctl created ipv6 1.21 cluster (see yamls at the end)
this instance types: c5.4xlarge, c5a.8xlarge,... (so not the #1935 )
in kubelet logs we can see that node did register, but fails to init network. kubelet logs also show a lot timeouts.
During this time kubernetes master was temporarily unavailable (scaling up due to 500+ nodes and 7000+ pods ?)
also
docker ps
shows no containers.full kubelet logs:
after this I rebooted the instance and docker ps did show containers.
aws-node logs:
and ipamd.log:
What you expected to happen:
Node to become healthy
How to reproduce it (as minimally and precisely as possible):
Environment:
kubectl version
): 1.21cat /etc/os-release
):uname -a
):Linux ip-192-168-0-194.eu-north-1.compute.internal 5.4.181-99.354.amzn2.x86_64 #1 SMP Wed Mar 2 18:50:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux