Azure / acs-engine

WE HAVE MOVED: Please join us at Azure/aks-engine!
https://github.com/Azure/aks-engine
MIT License
1.03k stars 560 forks source link

Kubernetes v1.7.9 master and worker nodes flapping #1821

Closed seanknox closed 5 years ago

seanknox commented 6 years ago

Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE


What version of acs-engine?: master:

commit 9ca92cbe31324dbd2c02093ae97536613b0c1d00 (HEAD -> master)
Author: Alexander Gabert <alex@xoreaxeax.de>
Date:   Tue Nov 21 17:51:31 2017 +0100

    typo intersted (#1808)

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)

What happened:

I'm seeing both masters and worker nodes regularly flap between Ready/NotReady on new v1.7.9 clusters with no activity. The following is output from new v1.7.9 and v1.8.2 clusters over ~25m:

1__kubectl_get_nodes_-w___grep_--color_auto__notready__kubectl_

controller-manager logs don't reveal much, other than confirming that nodes are transitioning to/from a NotReady state.

kube-controller-manager-k8s-master-26603479-2 kube-controller-manager I1122 22:46:17.734930       1 controller_utils.go:285] Recording status change NodeNotReady event message for node k8s-master-26603479-0
kube-controller-manager-k8s-master-26603479-2 kube-controller-manager I1122 22:46:17.735064       1 controller_utils.go:203] Update ready status of pods on node [k8s-master-26603479-0]
kube-controller-manager-k8s-master-26603479-2 kube-controller-manager I1122 22:46:17.735402       1 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"k8s-master-26603479-0", UID:"da7ecb87-cfd3-11e7-b69f-000d3a9476c0", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node k8s-master-26603479-0 status is now: NodeNotReady
kube-controller-manager-k8s-master-26603479-2 kube-controller-manager I1122 22:46:17.874493       1 controller_utils.go:220] Updating ready status of pod kube-addon-manager-k8s-master-26603479-0 to false
kube-controller-manager-k8s-master-26603479-2 kube-controller-manager I1122 22:46:18.086843       1 controller_utils.go:220] Updating ready status of pod kube-apiserver-k8s-master-26603479-0 to false
kube-controller-manager-k8s-master-26603479-2 kube-controller-manager I1122 22:46:18.238899       1 controller_utils.go:220] Updating ready status of pod kube-controller-manager-k8s-master-26603479-0 to false
kube-controller-manager-k8s-master-26603479-2 kube-controller-manager I1122 22:46:18.453973       1 controller_utils.go:220] Updating ready status of pod kube-proxy-j24f7 to false
kube-controller-manager-k8s-master-26603479-2 kube-controller-manager I1122 22:46:18.634162       1 controller_utils.go:220] Updating ready status of pod kube-scheduler-k8s-master-26603479-0 to false
kube-controller-manager-k8s-master-26603479-2 kube-controller-manager I1122 22:49:09.490344       1 controller_utils.go:285] Recording status change NodeNotReady event message for node k8s-agent-26603479-2
kube-controller-manager-k8s-master-26603479-2 kube-controller-manager I1122 22:49:09.490371       1 controller_utils.go:203] Update ready status of pods on node [k8s-agent-26603479-2]
kube-controller-manager-k8s-master-26603479-2 kube-controller-manager I1122 22:49:09.490496       1 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"k8s-agent-26603479-2", UID:"c67e02a4-cfd3-11e7-b87f-000d3a947f4f", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node k8s-agent-26603479-2 status is now: NodeNotReady

When a node enters NotReady, it's status becomes unknown, indicating that the kubelet temporarily stopped or couldn't connected to the apiserver:

Name:               k8s-agent-26603479-2
Roles:              agent
Labels:             agentpool=agent
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=Standard_D2_v2
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=centralus
                    failure-domain.beta.kubernetes.io/zone=1
                    kubernetes.azure.com/cluster=trestles-centralus-v179-5a15f7c1
                    kubernetes.io/hostname=k8s-agent-26603479-2
                    kubernetes.io/role=agent
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Wed, 22 Nov 2017 14:23:33 -0800
Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----                 ------    -----------------                 ------------------                ------              -------
  NetworkUnavailable   False     Wed, 22 Nov 2017 14:24:05 -0800   Wed, 22 Nov 2017 14:24:05 -0800   RouteCreated        RouteController created a route
  OutOfDisk            Unknown   Wed, 22 Nov 2017 14:51:17 -0800   Wed, 22 Nov 2017 14:50:40 -0800   NodeStatusUnknown   Kubelet stopped posting node status.
  MemoryPressure       Unknown   Wed, 22 Nov 2017 14:51:17 -0800   Wed, 22 Nov 2017 14:50:40 -0800   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure         Unknown   Wed, 22 Nov 2017 14:51:17 -0800   Wed, 22 Nov 2017 14:50:40 -0800   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready                Unknown   Wed, 22 Nov 2017 14:51:17 -0800   Wed, 22 Nov 2017 14:50:40 -0800   NodeStatusUnknown   Kubelet stopped posting node status.

How to reproduce it (as minimally and precisely as possible):

seanknox commented 6 years ago

@jackfrancis @jchauncey FYI

DonMartin76 commented 6 years ago

Sounds like a similar issue like the ones we have/have had on North Europe since a couple of weeks. Or does this always occur, or just sometimes?

mboersma commented 5 years ago

I'm closing this old issue since acs-engine is deprecated in favor of aks-engine.