choerodon / kubeadm-ansible

Kuberadmin ansible is a toolkit for simple and quick installing k8s cluster.
http://choerodon.io
Apache License 2.0
39 stars 34 forks source link

Node status keeps on NotReady #14

Closed eliu closed 6 years ago

eliu commented 6 years ago

Using latest kubeadm-ansible scripts to deploy a 3-node k8s cluster for several times, it keeps getting the NotReady status on the other 2 worker nodes.

Node Status

[root@k8s-master ~] ○ kubectl get node
NAME         STATUS     ROLES     AGE       VERSION
k8s-master   Ready      master    1d        v1.8.5
k8s-node01   NotReady   <none>    1d        v1.8.5
k8s-node02   NotReady   <none>    1d        v1.8.5

k8s-node01 details:

[root@k8s-master ~] ○ kubectl describe node k8s-node01
Name:               k8s-node01
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=k8s-node01
Annotations:        flannel.alpha.coreos.com/backend-data={"VtepMAC":"0a:12:15:65:69:3e"}
                    flannel.alpha.coreos.com/backend-type=vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager=true
                    flannel.alpha.coreos.com/public-ip=192.168.123.156
                    node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Mon, 24 Sep 2018 00:46:05 +0800
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                     Message
  ----             ------    -----------------                 ------------------                ------                     -------
  OutOfDisk        False     Mon, 24 Sep 2018 00:47:12 +0800   Mon, 24 Sep 2018 00:46:03 +0800   KubeletHasSufficientDisk   kubelet has sufficient disk space available
  MemoryPressure   Unknown   Mon, 24 Sep 2018 00:47:12 +0800   Mon, 24 Sep 2018 00:47:58 +0800   NodeStatusUnknown          Kubelet stopped posting node status.
  DiskPressure     Unknown   Mon, 24 Sep 2018 00:47:12 +0800   Mon, 24 Sep 2018 00:47:58 +0800   NodeStatusUnknown          Kubelet stopped posting node status.
  Ready            Unknown   Mon, 24 Sep 2018 00:47:12 +0800   Mon, 24 Sep 2018 00:47:58 +0800   NodeStatusUnknown          Kubelet stopped posting node status.
Addresses:
  InternalIP:  192.168.123.156
  Hostname:    k8s-node01
Capacity:
 cpu:     4
 memory:  7912008Ki
 pods:    110
Allocatable:
 cpu:     3900m
 memory:  6339144Ki
 pods:    110
System Info:
 Machine ID:                 61a1ab8e69e54e74859435e52c8fa778
 System UUID:                499EB1E3-BB93-498F-BE2C-AFAF3B44EF72
 Boot ID:                    def4deec-e4c3-40b4-aad1-cb8f6bd81f87
 Kernel Version:             3.10.0-693.5.2.el7.x86_64
 OS Image:                   CentOS Linux 7 (Core)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://Unknown
 Kubelet Version:            v1.8.5
 Kube-Proxy Version:         v1.8.5
PodCIDR:                     10.233.65.0/24
ExternalID:                  k8s-node01
Non-terminated Pods:         (3 in total)
  Namespace                  Name                      CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                      ------------  ----------  ---------------  -------------
  kube-system                kube-flannel-zqvkw        150m (3%)     300m (7%)   64M (0%)         500M (7%)
  kube-system                kube-proxy-899nh          0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                nginx-proxy-k8s-node01    25m (0%)      300m (7%)   32M (0%)         512M (7%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  175m (4%)     600m (15%)  96M (1%)         1012M (15%)
Events:         <none>

kubelet status on k8s-node01

[root@k8s-node01 ~] ○ systemctl status kubelet -l
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf, 20-kubelet-override.conf
   Active: active (running) since 一 2018-09-24 00:46:05 CST; 1 day 8h ago
     Docs: http://kubernetes.io/docs/
 Main PID: 61743 (kubelet)
    Tasks: 19
   Memory: 42.1M
   CGroup: /system.slice/kubelet.service
           └─61743 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin --cluster-dns=10.233.0.10 --cluster-domain=cluster.local --authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt --cadvisor-port=4194 --cgroup-driver=systemd --rotate-certificates=true --cert-dir=/var/lib/kubelet/pki --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/choerodon-tools/pause-amd64:3.0 --fail-swap-on=false --hostname-override=k8s-node01 --eviction-hard=memory.available<512Mi,nodefs.available<10Gi,imagefs.available<10Gi --eviction-minimum-reclaim=memory.available=500Mi,nodefs.available=5Gi,imagefs.available=5Gi --eviction-pressure-transition-period=5m0s --system-reserved=cpu=100m,memory=1Gi

9月 25 09:38:17 k8s-node01 kubelet[61743]: E0925 09:38:17.328812   61743 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:422: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3Dk8s-node01&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
9月 25 09:38:17 k8s-node01 kubelet[61743]: W0925 09:38:17.912621   61743 eviction_manager.go:332] eviction manager: attempting to reclaim nodefs
9月 25 09:38:17 k8s-node01 kubelet[61743]: I0925 09:38:17.912688   61743 eviction_manager.go:346] eviction manager: must evict pod(s) to reclaim nodefs
9月 25 09:38:17 k8s-node01 kubelet[61743]: E0925 09:38:17.912704   61743 eviction_manager.go:357] eviction manager: eviction thresholds have been met, but no pods are active to evict
9月 25 09:38:18 k8s-node01 kubelet[61743]: E0925 09:38:18.328171   61743 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:413: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
9月 25 09:38:18 k8s-node01 kubelet[61743]: E0925 09:38:18.329065   61743 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dk8s-node01&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
9月 25 09:38:18 k8s-node01 kubelet[61743]: E0925 09:38:18.330004   61743 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:422: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3Dk8s-node01&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
9月 25 09:38:19 k8s-node01 kubelet[61743]: E0925 09:38:19.329183   61743 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:413: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
9月 25 09:38:19 k8s-node01 kubelet[61743]: E0925 09:38:19.329978   61743 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dk8s-node01&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
9月 25 09:38:19 k8s-node01 kubelet[61743]: E0925 09:38:19.331145   61743 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:422: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3Dk8s-node01&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
vinkdong commented 6 years ago

@eliu execute this command curl https://k8s-master:6443 -k on k8s-node01 to check firewall not drop the request

eliu commented 6 years ago

@vinkdong all nodes have disabled firewalld or iptables services. Here is the response from curl https://k8s-master:6443 -k

curl https://k8s-master:6443 -k
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {

  },
  "code": 403
eliu commented 6 years ago

I found kube-proxy and kube-flannel remained Pending status for worker nodes:

kubectl get pod -n kube-system
NAME                                        READY     STATUS    RESTARTS   AGE
default-http-backend-6dd4d5b7c9-v7f9m       1/1       Running   0          1d
heapster-746d67c7b9-dwnk9                   1/1       Running   0          1d
kube-apiserver-k8s-master                   1/1       Running   0          1d
kube-controller-manager-k8s-master          1/1       Running   0          1d
kube-dns-79d99555df-jhwmh                   3/3       Running   0          1d
kube-flannel-d7vzv                          1/1       Running   0          1d
kube-flannel-ll9sj                          0/1       Pending   0          1d
kube-flannel-zqvkw                          0/1       Pending   0          1d
kube-lego-6f45757db7-65cjb                  1/1       Running   0          1d
kube-proxy-5g7hv                            0/1       Pending   0          1d
kube-proxy-899nh                            0/1       Pending   0          1d
kube-proxy-hhlrc                            1/1       Running   0          1d
kube-scheduler-k8s-master                   1/1       Running   0          1d
kubernetes-dashboard-dc8fcdbc5-mxnx2        1/1       Running   0          1d
nginx-ingress-controller-5d77d4945d-z9hc9   1/1       Running   0          1d
nginx-proxy-k8s-node01                      1/1       Running   0          1d
nginx-proxy-k8s-node02                      1/1       Running   0          1d
vinkdong commented 6 years ago

This always happens when GC disk or memory failed. Check your disk and memory (disk > 10G, memory > 512m by default).