chris-short / rak8s

Stand up a Raspberry Pi based Kubernetes cluster with Ansible
MIT License
365 stars 112 forks source link

Nodes not ready on raspberrypi #28

Closed hkoessler closed 5 years ago

hkoessler commented 6 years ago

OS running on Ansible host:

Linux Mint 18

Ansible Version (ansible --version):

2.5.1

Uploaded logs showing errors(rak8s/.log/ansible.log)

n/a

Raspberry Pi Hardware Version:

Raspi 3 B

Raspberry Pi OS & Version (cat /etc/os-release):

PRETTY_NAME="Raspbian GNU/Linux 9 (stretch)" NAME="Raspbian GNU/Linux" VERSION_ID="9" VERSION="9 (stretch)" ID=raspbian ID_LIKE=debian

Detailed description of the issue:

I've set up 3 raspis with the 2018-03-13-raspbian-stretch-lite.img and the ansible scripts with tag 0.1.5 of this repo. After a few reboots the kubectl works but "sudo kubectl get nodes" reports the master node and a worker node notReady. on the master node "kubectl describe node ..." reports the following

runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized.

After that I directly on the master node tried "sudo kubeadm init" (just in order to check what happens) and I get

WARNING: [init] Using Kubernetes version: v1.10.1 [init] Using Authorization modes: [Node RBAC] [preflight] Running pre-flight checks. [WARNING SystemVerification]: docker version is greater than the most recently validated version. Docker version: 18.04.0-ce. Max validated version: 17.03 [WARNING FileExisting-crictl]: crictl not found in system path Suggestion: go get github.com/kubernetes-incubator/cri-tools/cmd/crictl [preflight] Some fatal errors occurred: [ERROR Port-6443]: Port 6443 is in use [ERROR Port-10250]: Port 10250 is in use [ERROR Port-10251]: Port 10251 is in use [ERROR Port-10252]: Port 10252 is in use [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists [ERROR Port-2379]: Port 2379 is in use [ERROR DirAvailable--var-lib-etcd]: /var/lib/etcd is not empty CPU hardcapping unsupported

By the way: During the setup with the ansible scripts I also got the Bug #26 I then executed the regarding command directly on my master node and that succeeded. After that I was able to rerun the playbook. I am running the playbook from a laptop "outside" the raspis.

chris-short commented 6 years ago

Try the latest release and let me know how that goes, please: https://github.com/rak8s/rak8s/releases/tag/v0.2.0

hkoessler commented 6 years ago

Tried to install with two freshly installed 2018-03-13-raspbian-stretch-lite.img on my raspis named raspic0 and raspic1. I now use nsible 2.5.2 on Linux Mint 18.2

"ansible-playbook cluster.yml" results in


/ TASK [common : Pass bridged IPv4 traffic to iptables' \ \ chains] /

    \   ^__^
     \  (oo)\_______
        (__)\       )\/\
            ||----w |
            ||     ||

fatal: [raspic1]: FAILED! => {"changed": false, "msg": "Failed to reload sysctl: sysctl: cannot stat /proc/sys/net/bridge/bridge-nf-call-iptables: No such file or directory\n"} fatal: [raspic0]: FAILED! => {"changed": false, "msg": "Failed to reload sysctl: sysctl: cannot stat /proc/sys/net/bridge/bridge-nf-call-iptables: No such file or directory\n"}

After that I stopped the ansible-playbook with CTRL+C and re ran it again.

The Task [common: Pass bridged IPv4 traffic to iptables chains] succeeded after that. But it failed at the Run Docker Install Script with


< TASK [kubeadm : Run Docker Install Script] >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/\
            ||----w |
            ||     ||

fatal: [raspic1]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 100, "stderr": "Shared connection to raspic1 closed.\r\n", "stdout": "# Executing docker install script, commit: 1d31602\r\n+ sh -c apt-get update -qq >/dev/null\r\n+ sh -c apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null\r\n+ sh -c curl -fsSL \"https://download.docker.com/linux/raspbian/gpg\" | apt-key add -qq - >/dev/null\r\nWarning: apt-key output should not be parsed (stdout is not a terminal)\r\n+ sh -c echo \"deb [arch=armhf] https://download.docker.com/linux/raspbian stretch edge\" > /etc/apt/sources.list.d/docker.list\r\n+ [ raspbian = debian ]\r\n+ sh -c apt-get update -qq >/dev/null\r\n+ sh -c apt-get install -y -qq --no-install-recommends docker-ce >/dev/null\r\nE: Sub-process /usr/bin/dpkg returned an error code (1)\r\n", "stdout_lines": ["# Executing docker install script, commit: 1d31602", "+ sh -c apt-get update -qq >/dev/null", "+ sh -c apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null", "+ sh -c curl -fsSL \"https://download.docker.com/linux/raspbian/gpg\" | apt-key add -qq - >/dev/null", "Warning: apt-key output should not be parsed (stdout is not a terminal)", "+ sh -c echo \"deb [arch=armhf] https://download.docker.com/linux/raspbian stretch edge\" > /etc/apt/sources.list.d/docker.list", "+ [ raspbian = debian ]", "+ sh -c apt-get update -qq >/dev/null", "+ sh -c apt-get install -y -qq --no-install-recommends docker-ce >/dev/null", "E: Sub-process /usr/bin/dpkg returned an error code (1)"]} fatal: [raspic0]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 100, "stderr": "Shared connection to raspic0 closed.\r\n", "stdout": "# Executing docker install script, commit: 1d31602\r\n+ sh -c apt-get update -qq >/dev/null\r\n+ sh -c apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null\r\n+ sh -c curl -fsSL \"https://download.docker.com/linux/raspbian/gpg\" | apt-key add -qq - >/dev/null\r\nWarning: apt-key output should not be parsed (stdout is not a terminal)\r\n+ sh -c echo \"deb [arch=armhf] https://download.docker.com/linux/raspbian stretch edge\" > /etc/apt/sources.list.d/docker.list\r\n+ [ raspbian = debian ]\r\n+ sh -c apt-get update -qq >/dev/null\r\n+ sh -c apt-get install -y -qq --no-install-recommends docker-ce >/dev/null\r\nE: Sub-process /usr/bin/dpkg returned an error code (1)\r\n", "stdout_lines": ["# Executing docker install script, commit: 1d31602", "+ sh -c apt-get update -qq >/dev/null", "+ sh -c apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null", "+ sh -c curl -fsSL \"https://download.docker.com/linux/raspbian/gpg\" | apt-key add -qq - >/dev/null", "Warning: apt-key output should not be parsed (stdout is not a terminal)", "+ sh -c echo \"deb [arch=armhf] https://download.docker.com/linux/raspbian stretch edge\" > /etc/apt/sources.list.d/docker.list", "+ [ raspbian = debian ]", "+ sh -c apt-get update -qq >/dev/null", "+ sh -c apt-get install -y -qq --no-install-recommends docker-ce >/dev/null", "E: Sub-process /usr/bin/dpkg returned an error code (1)"]}

hkoessler commented 6 years ago

After that I rebooted raspic0 and raspic1 manually. After that a new "ansible-playbook cluster.yml" succeeded.

But the nodes are still NotReady. If I log into the master (which is my raspic0) "sudo kubectl get nodes" responds with NAME STATUS ROLES AGE VERSION raspic0 NotReady master 6m v1.10.2 raspic1 NotReady 5m v1.10.2

additionally the master still shows the following message if queried with "sudo kubectl describe raspic0":

Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message


OutOfDisk False Tue, 01 May 2018 21:59:06 +0000 Tue, 01 May 2018 21:52:12 +0000 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Tue, 01 May 2018 21:59:06 +0000 Tue, 01 May 2018 21:52:12 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 01 May 2018 21:59:06 +0000 Tue, 01 May 2018 21:52:12 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 01 May 2018 21:59:06 +0000 Tue, 01 May 2018 21:52:12 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Tue, 01 May 2018 21:59:06 +0000 Tue, 01 May 2018 21:52:12 +0000 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized. WARNING: CPU hardcapping unsupported

hkoessler commented 6 years ago

That means that the latest master version didn't fix that problem.

tedsluis commented 6 years ago

I have had this occasionally. Just reboot the cluster and it will work.

Op wo 2 mei 2018 01:01 schreef hkoessler notifications@github.com:

That means that the latest master version didn't fix that problem.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rak8s/rak8s/issues/28#issuecomment-385804042, or mute the thread https://github.com/notifications/unsubscribe-auth/AJoVD1iRL_ektTy-EzB03KNLDba6LyP2ks5tuNtLgaJpZM4TlDcj .

chris-short commented 6 years ago

One of the things we definitely need to work on is reliability and consistency. Sadly, I don't have a cluster to test with and the cluster I created with rak8s is actively in use doing things here for me. Pretty sure I'd be accosted for buying a three node cluster for testing.

Chris Short https://chrisshort.net https://devopsish.com

On Wed, May 2, 2018 at 5:10 AM, Ted Sluis notifications@github.com wrote:

I have had this occasionally. Just reboot the cluster and it will work.

Op wo 2 mei 2018 01:01 schreef hkoessler notifications@github.com:

That means that the latest master version didn't fix that problem.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rak8s/rak8s/issues/28#issuecomment-385804042, or mute the thread https://github.com/notifications/unsubscribe-auth/AJoVD1iRL_ektTy- EzB03KNLDba6LyP2ks5tuNtLgaJpZM4TlDcj .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rak8s/rak8s/issues/28#issuecomment-385913860, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVB-dp28E-XFh1uSEGtmSLA6qKQ9J-kks5tuXgKgaJpZM4TlDcj .

tedsluis commented 6 years ago

To investigate this issue I have deployed a fresh single node cluster this afternoon 5x times. 1x time it failed on kubeadm init (timeout, probably caused by slow internet). All other attemps the playbook finished successful, but in one occasion the node was 'not ready'. In that case 'system status kubelet' reported 'network plugin is not ready: cni config uninitialized.' The issue was resolved after a reboot, just as I had seen before.

One thing I noticed is, when I delete an existing cluster using 'kubeadm reset', I need to reboot the node (for this test I used only one node), otherwise a clean install (using 'ansible-playbook cluster.yaml') ends up with the 'node not ready' due to 'network plugin is not ready: cni config uninitialized.' In the case my test failed, I had forgotten to reboot the node after manually removing the cluster via 'kubeadm reset'.

If you remove a cluster via "kubeadm reset", "ip add" still displays all the kubernetes weave networks:.

$ kubeadm reset
[preflight] Running pre-flight checks.
[reset] Stopping the kubelet service.
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Removing kubernetes-managed containers.
[reset] Deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes /var/lib/etcd]
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether b8:27:eb:cf:d0:f3 brd ff:ff:ff:ff:ff:ff
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether b8:27:eb:9a:85:a6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.51.67/24 brd 192.168.51.255 scope global wlan0
       valid_lft forever preferred_lft forever
    inet6 fe80::8fe7:a366:5c7b:1cd0/64 scope link 
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:3c:da:1b:3a brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
5: datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 96:6c:d8:62:af:8f brd ff:ff:ff:ff:ff:ff
    inet 169.254.148.180/16 brd 169.254.255.255 scope global datapath
       valid_lft forever preferred_lft forever
    inet6 fe80::c89b:530d:49c3:5f8c/64 scope link 
       valid_lft forever preferred_lft forever
7: weave: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state DORMANT group default qlen 1000
    link/ether 12:dd:ea:1e:56:87 brd ff:ff:ff:ff:ff:ff
    inet 10.32.0.1/12 brd 10.47.255.255 scope global weave
       valid_lft forever preferred_lft forever
8: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 66:1c:af:dc:30:32 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::323d:1767:5891:a52b/64 scope link 
       valid_lft forever preferred_lft forever
10: vethwe-datapath@vethwe-bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master datapath state UP group default 
    link/ether ca:10:1f:89:5d:d5 brd ff:ff:ff:ff:ff:ff
    inet 169.254.205.166/16 brd 169.254.255.255 scope global vethwe-datapath
       valid_lft forever preferred_lft forever
    inet6 fe80::c810:1fff:fe89:5dd5/64 scope link 
       valid_lft forever preferred_lft forever
11: vethwe-bridge@vethwe-datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether 3e:ad:99:80:83:c6 brd ff:ff:ff:ff:ff:ff
    inet 169.254.62.104/16 brd 169.254.255.255 scope global vethwe-bridge
       valid_lft forever preferred_lft forever
    inet6 fe80::3cad:99ff:fe80:83c6/64 scope link 
       valid_lft forever preferred_lft forever
12: vxlan-6784: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65535 qdisc noqueue master datapath state UNKNOWN group default qlen 1000
    link/ether b2:7b:14:3c:1d:d6 brd ff:ff:ff:ff:ff:ff
    inet 169.254.5.32/16 brd 169.254.255.255 scope global vxlan-6784
       valid_lft forever preferred_lft forever
    inet6 fe80::b07b:14ff:fe3c:1dd6/64 scope link 
       valid_lft forever preferred_lft forever

After a reboot those networks will be gone:.

$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether b8:27:eb:cf:d0:f3 brd ff:ff:ff:ff:ff:ff
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether b8:27:eb:9a:85:a6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.51.67/24 brd 192.168.51.255 scope global wlan0
       valid_lft forever preferred_lft forever
    inet6 fe80::8fe7:a366:5c7b:1cd0/64 scope link 
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:3e:39:3a:54 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever

So my advise is to do reboot the nodes before you do a second install or in case you run into the "node not ready" due to "network plugin is not ready: cni config uninitialized" issue.

chris-short commented 6 years ago

If you could put that on the discourse site that'd be amazing! https://discourse.rak8s.io/

Chris Short https://chrisshort.net https://devopsish.com

On Wed, May 2, 2018 at 9:54 AM, Ted Sluis notifications@github.com wrote:

To investigate this issue I have deployed a fresh single node cluster this afternoon 5x times. 1x time it failed on kubeadm init (timeout, probably caused by slow internet). All other attemps the playbook finished successful, but in one occasion the node was 'not ready'. In that case 'system status kubelet' reported 'network plugin is not ready: cni config uninitialized.' The issue was resolved after a reboot, just as I had seen before.

One thing I noticed is, when I delete an existing cluster using 'kubeadm reset', I need to reboot the node (for this test I used only one node), otherwise a clean install (using 'ansible-playbook cluster.yaml') ends up with the 'node not ready' due to 'network plugin is not ready: cni config uninitialized.' In the case my test failed, I had forgotten to reboot the node after manually removing the cluster via 'kubeadm reset'.

If you remove a cluster via "kubeadm reset", "up add" still displays all the kubernetes weave networks. After a reboot those will be gone.

So my advise is to do reboot the nodes, before you do a second install or in case you run into the "node not ready" due to "network plugin is not ready: cni config uninitialized" issue.

Op 2 mei 2018 16:00 schreef "Chris Short" notifications@github.com:

One of the things we definitely need to work on is reliability and consistency. Sadly, I don't have a cluster to test with and the cluster I created with rak8s is actively in use doing things here for me. Pretty sure I'd be accosted for buying a three node cluster for testing.

Chris Short https://chrisshort.net https://devopsish.com

On Wed, May 2, 2018 at 5:10 AM, Ted Sluis notifications@github.com wrote:

I have had this occasionally. Just reboot the cluster and it will work.

Op wo 2 mei 2018 01:01 schreef hkoessler notifications@github.com:

That means that the latest master version didn't fix that problem.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rak8s/rak8s/issues/28#issuecomment-385804042, or mute the thread https://github.com/notifications/unsubscribe-auth/AJoVD1iRL_ektTy- EzB03KNLDba6LyP2ks5tuNtLgaJpZM4TlDcj .

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/rak8s/rak8s/issues/28#issuecomment-385913860, or mute the thread < https://github.com/notifications/unsubscribe-auth/ABVB-dp28E- XFh1uSEGtmSLA6qKQ9J-kks5tuXgKgaJpZM4TlDcj

.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/rak8s/rak8s/issues/28#issuecomment-385969211, or mute the thread https://github.com/notifications/unsubscribe-auth/ AJoVD5ka672JxQrrwNblCjWbpmQD0aZVks5tua3PgaJpZM4TlDcj .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rak8s/rak8s/issues/28#issuecomment-385985613, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVB-W2U6K3Xi-a-RiSz_5pD4h73jTMaks5tubqhgaJpZM4TlDcj .

tedsluis commented 6 years ago
Quote Chris-short: One of the things we definitely need to work on is reliability and
consistency. Sadly, I don't have a cluster to test with and the cluster I
created with rak8s is actively in use doing things here for me. Pretty sure
I'd be accosted for buying a three node cluster for testing.

Yes, testing is definitively important. I am looking into the Kubernetes end to end test:

I will try to set it up. I can run tests on a 3 node cluster.

hkoessler commented 6 years ago

I Chris-short, hi tedsluis: I used the last version of the development branch and now all the nodes are running. But if I try to deploy a pod/service then it doesn't resond.

eg the commands: $ sudo kubectl run hypriot --image=hypriot/rpi-busybox-httpd --replicas=3 --port=80 $ sudo kubectl get endpoints hypriot

results in the following: $ sudo kubectl get endpoints hypriot NAME ENDPOINTS AGE hypriot 172.30.1.6:80,172.30.2.4:80,172.30.3.5:80 11m

but "curl 172.30.1.6:80" doesn't give a result.

On the other hand checking the services results in $ sudo kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE hypriot ClusterIP 10.99.49.200 80/TCP 13m

That means that the cluster IP is in a completely different network than the endpoints. Is that a normal behavior? What should I check in order to find out why the container doesnt respond to the curl?

chris-short commented 5 years ago

I'm going to close this. Please grab the latest version and try again. If there are bugs, please submit them.

rushins commented 5 years ago

i had the same issue where nodes always on "Not Ready" and i am on latest 1.12.2. any help

chris-short commented 5 years ago

Try an older version. https://github.com/rak8s/rak8s/tree/v0.2.1