Azure / acs-engine

WE HAVE MOVED: Please join us at Azure/aks-engine!
https://github.com/Azure/aks-engine
MIT License
1.03k stars 560 forks source link

Creating a 100 node k8s cluster results in all nodes being not ready #998

Closed yuvipanda closed 7 years ago

yuvipanda commented 7 years ago

Is this a request for help?: Maybe?


Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE


What version of acs-engine?: v0.2.0


Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) Kubernetes

What happened: I tried to create a cluster with one agent pool of 100 nodes.

All the nodes came up, but are in NotReady. Describe gives me:

``` datahub@k8s-master-28445475-0:~$ kubectl describe node k8s-agentpool1-28445475-86 Name: k8s-agentpool1-28445475-86 Role: Labels: agentpool=agentpool1 beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=Standard_DS11_v2_Promo beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=westus failure-domain.beta.kubernetes.io/zone=0 kubernetes.io/hostname=k8s-agentpool1-28445475-86 role=agent Annotations: volumes.kubernetes.io/controller-managed-attach-detach=true Taints: CreationTimestamp: Fri, 14 Jul 2017 18:04:47 +0000 Phase: Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- OutOfDisk False Fri, 14 Jul 2017 18:08:19 +0000 Fri, 14 Jul 2017 18:04:47 +0000 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Fri, 14 Jul 2017 18:08:19 +0000 Fri, 14 Jul 2017 18:04:47 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Fri, 14 Jul 2017 18:08:19 +0000 Fri, 14 Jul 2017 18:04:47 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure Ready False Fri, 14 Jul 2017 18:08:19 +0000 Fri, 14 Jul 2017 18:04:47 +0000 KubeletNotReady Kubelet failed to get node info: failed to get external ID from cloud provider: compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription.",runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR Addresses: 10.240.0.80,k8s-agentpool1-28445475-86 Capacity: alpha.kubernetes.io/nvidia-gpu: 0 cpu: 2 memory: 14359708Ki pods: 110 Allocatable: alpha.kubernetes.io/nvidia-gpu: 0 cpu: 2 memory: 14257308Ki pods: 110 System Info: Machine ID: d51d78572977e1688e52c5fcf9b253e7 System UUID: 84FA91EB-2376-AB4E-B3A8-B232C5F1D086 Boot ID: 93dbd81c-b195-47f1-9234-0386c9b3e101 Kernel Version: 4.4.0-83-generic OS Image: Debian GNU/Linux 8 (jessie) Operating System: linux Architecture: amd64 Container Runtime Version: docker://1.12.6 Kubelet Version: v1.6.6 Kube-Proxy Version: v1.6.6 ExternalID: /subscriptions/316f6b65-662a-4687-82ac-cbf564f7594e/resourceGroups/jh-perf-03/providers/Microsoft.Compute/virtualMachines/k8s-agentpool1-28445475-86 Non-terminated Pods: (0 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits --------- ---- ------------ ---------- --------------- ------------- Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) CPU Requests CPU Limits Memory Requests Memory Limits ------------ ---------- --------------- ------------- 0 (0%) 0 (0%) 0 (0%) 0 (0%) Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 3m 3m 1 kubelet, k8s-agentpool1-28445475-86 Normal Starting Starting kubelet. 3m 3m 1 kubelet, k8s-agentpool1-28445475-86 Warning ImageGCFailed unable to find data for container / 3m 3m 1 kubelet, k8s-agentpool1-28445475-86 Warning KubeletSetupFailed Kubelet failed to get node info: failed to get external ID from cloud provider: compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription." 3m 3m 1 kubelet, k8s-agentpool1-28445475-86 Normal NodeHasSufficientDisk Node k8s-agentpool1-28445475-86 status is now: NodeHasSufficientDisk 3m 3m 1 kubelet, k8s-agentpool1-28445475-86 Normal NodeHasSufficientMemory Node k8s-agentpool1-28445475-86 status is now: NodeHasSufficientMemory 3m 3m 1 kubelet, k8s-agentpool1-28445475-86 Normal NodeHasNoDiskPressure Node k8s-agentpool1-28445475-86 status is now: NodeHasNoDiskPressure ```

Important part of that seems to be:

Kubelet failed to get node info: failed to get external ID from cloud provider: compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription.",runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR

This looks like the Azure API has limits that are hit in this case?

What you expected to happen: Cluster comes up and nodes become ready.

How to reproduce it (as minimally and precisely as possible): Create a k8s cluster with 100 nodes in one agent pool

seanknox commented 7 years ago

Hi @yuvipanda,

Please try using v0.3.0 or later. We merged support for exponential cloud backoff that was recently merged into Kubernetes upstream.

Declare your kubernetes cluster API model config as you normally would, with the following requirements:

As Kubernetes excels in binpacking pods onto available instances, vertically scaling VM sizes (more CPU/RAM) is better approach for expanding cluster capacity.

As a followup, I'm going to add some documentation in the repo about this.

seanknox commented 7 years ago

Forgot to include a Kubernetes cluster config for 100 nodes: https://github.com/Azure/acs-engine/blob/master/examples/largeclusters/kubernetes.json

yuvipanda commented 7 years ago

Thanks! This was useful. We wanted a large cluster to perf-test our code (https://github.com/jupyterhub/helm-chart/issues/46), so had to spin up a big one with big machines. Also we were maxing out number of pods per node - current kubelet default is 110, so if you want more pods than that you have to add more boxes rather than get bigger boxes.