Capgemini / kubeform

Form your :boat: Kubernetes :anchor: cluster anywhere using CoreOS, Terraform and Ansible
https://capgemini.github.io/kubeform
MIT License
326 stars 57 forks source link

Timeout issue waiting for kube-apiserver #167

Open owenmorgan opened 7 years ago

owenmorgan commented 7 years ago

When I get to running

ansible-playbook -u core --ssh-common-args="-i /tmp/kubeform/terraform/aws/public-cloud/id_rsa -q" --inventory-file=inventory site.yml -e kube_apiserver_vip=$(cd /tmp/kubeform/terraform/aws/public-cloud && terraform output master_elb_hostname)

It runs through fine until it waits for the kube-apiserver task. It will time out.

TASK [kube-master : wait for kube-apiserver up] **** fatal: [kube-master-2]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for 127.0.0.1:8080"} fatal: [kube-master-0]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for 127.0.0.1:8080"} fatal: [kube-master-1]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for 127.0.0.1:8080"}

I have checked on one of the masters. and see this

ip-10-0-2-98 core # curl http://127.0.0.1:8080 curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused ip-10-0-2-98 core # sudo systemctl status kubelet ● kubelet.service - Kubernetes Kubelet Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled) Active: active (running) since Sat 2016-09-17 18:06:58 BST; 4min 55s ago Docs: https://github.com/GoogleCloudPlatform/kubernetes Main PID: 5072 (kubelet) Tasks: 7 Memory: 37.8M CPU: 4.551s CGroup: /system.slice/kubelet.service └─5072 /kubelet --api-servers=http://127.0.0.1:8080 --network-plugin-dir=/etc/kubernetes/cni/net.d --network-plugin=cni --register-schedulable=false --allow-privileged=true --config=/etc/kubernetes/manifests --hostn

Sep 17 18:11:52 ip-10-0-2-98.eu-west-1.compute.internal kubelet-wrapper[5072]: I0917 17:11:52.114644 5072 kubelet.go:1197] Attempting to register node ip-10-0-2-98.eu-west-1.compute.internal Sep 17 18:11:52 ip-10-0-2-98.eu-west-1.compute.internal kubelet-wrapper[5072]: I0917 17:11:52.114881 5072 kubelet.go:1200] Unable to register ip-10-0-2-98.eu-west-1.compute.internal with the apiserver: Post http://127.0.0.1 Sep 17 18:11:52 ip-10-0-2-98.eu-west-1.compute.internal kubelet-wrapper[5072]: E0917 17:11:52.405733 5072 reflector.go:205] pkg/kubelet/kubelet.go:286: Failed to list api.Node: Get http://127.0.0.1:8080/api/v1/nodes?fieldS Sep 17 18:11:52 ip-10-0-2-98.eu-west-1.compute.internal kubelet-wrapper[5072]: E0917 17:11:52.420330 5072 generic.go:197] GenericPLEG: Unable to retrieve pods: Cannot connect to the Docker daemon. Is the docker daemon runni Sep 17 18:11:52 ip-10-0-2-98.eu-west-1.compute.internal kubelet-wrapper[5072]: E0917 17:11:52.420417 5072 reflector.go:205] pkg/kubelet/config/apiserver.go:43: Failed to list api.Pod: Get http://127.0.0.1:8080/api/v1/pods? Sep 17 18:11:52 ip-10-0-2-98.eu-west-1.compute.internal kubelet-wrapper[5072]: E0917 17:11:52.421192 5072 reflector.go:205] pkg/kubelet/kubelet.go:267: Failed to list api.Service: Get http://127.0.0.1:8080/api/v1/services? Sep 17 18:11:53 ip-10-0-2-98.eu-west-1.compute.internal kubelet-wrapper[5072]: E0917 17:11:53.406291 5072 reflector.go:205] pkg/kubelet/kubelet.go:286: Failed to list api.Node: Get http://127.0.0.1:8080/api/v1/nodes?fieldS Sep 17 18:11:53 ip-10-0-2-98.eu-west-1.compute.internal kubelet-wrapper[5072]: E0917 17:11:53.420731 5072 reflector.go:205] pkg/kubelet/config/apiserver.go:43: Failed to list api.Pod: Get http://127.0.0.1:8080/api/v1/pods? Sep 17 18:11:53 ip-10-0-2-98.eu-west-1.compute.internal kubelet-wrapper[5072]: E0917 17:11:53.420890 5072 generic.go:197] GenericPLEG: Unable to retrieve pods: Cannot connect to the Docker daemon. Is the docker daemon runni Sep 17 18:11:53 ip-10-0-2-98.eu-west-1.compute.internal kubelet-wrapper[5072]: E0917 17:11:53.421620 5072 reflector.go:205] pkg/kubelet/kubelet.go:267: Failed to list api.Service: Get http://127.0.0.1:8080/api/v1/services?

Any ideas?

Thanks

owenmorgan commented 7 years ago

might be worth noting that when i ssh into one of the masters i receive this message.

Failed Units: 3 docker.service setup-network-environment.service docker.socket

enxebre commented 7 years ago

hey @owenmorgan what does systemd docker logs say?

tamsky commented 7 years ago

I see the same thing, as well, workers look like:

Failed Units: 2
  docker.service
  setup-network-environment.service

@enxebre here are the systemd docker.service logs for a master

core@ip-10-0-1-61 ~ $ journalctl --unit=docker.service  | cat
[snip until reboot]
-- Reboot --
Nov 17 20:45:22 ip-10-0-1-61.us-west-2.compute.internal systemd[1]: [/etc/systemd/system/docker.s
ervice.d/60-wait-for-flannel-config.conf:4] Unknown lvalue 'Restart' in section 'Unit'
Nov 17 20:45:22 ip-10-0-1-61.us-west-2.compute.internal systemd[1]: [/etc/systemd/system/docker.s
ervice.d/60-wait-for-flannel-config.conf:5] Unknown lvalue 'Restart' in section 'Unit'
Nov 17 20:45:29 ip-10-0-1-61.us-west-2.compute.internal systemd[1]: [/etc/systemd/system/docker.s
ervice.d/60-wait-for-flannel-config.conf:4] Unknown lvalue 'Restart' in section 'Unit'
Nov 17 20:45:29 ip-10-0-1-61.us-west-2.compute.internal systemd[1]: [/etc/systemd/system/docker.s
ervice.d/60-wait-for-flannel-config.conf:5] Unknown lvalue 'Restart' in section 'Unit'
Nov 17 20:45:30 ip-10-0-1-61.us-west-2.compute.internal systemd[1]: [/etc/systemd/system/docker.s
ervice.d/60-wait-for-flannel-config.conf:4] Unknown lvalue 'Restart' in section 'Unit'
Nov 17 20:45:30 ip-10-0-1-61.us-west-2.compute.internal systemd[1]: [/etc/systemd/system/docker.s
ervice.d/60-wait-for-flannel-config.conf:5] Unknown lvalue 'Restart' in section 'Unit'
Nov 17 20:45:35 ip-10-0-1-61.us-west-2.compute.internal systemd[1]: Started Docker Application Co
ntainer Engine.
Nov 17 20:45:37 ip-10-0-1-61.us-west-2.compute.internal dockerd[1028]: dockerd: "dockerd" require
s 0 arguments.
Nov 17 20:45:37 ip-10-0-1-61.us-west-2.compute.internal dockerd[1028]: Usage:        dockerd [OPT
IONS]
Nov 17 20:45:37 ip-10-0-1-61.us-west-2.compute.internal systemd[1]: docker.service: Main process
exited, code=exited, status=1/FAILURE
Nov 17 20:45:37 ip-10-0-1-61.us-west-2.compute.internal systemd[1]: docker.service: Unit entered
failed state.
Nov 17 20:45:37 ip-10-0-1-61.us-west-2.compute.internal systemd[1]: docker.service: Failed with r
esult 'exit-code'.
[snip]
enxebre commented 7 years ago

https://github.com/Capgemini/kubeform/blob/master/terraform/aws/public-cloud/master-cloud-config.yml.tpl#L46 seems wrong and duplicated. restart=always should be inside [service] , probably same for the others cloud-config

methril commented 7 years ago

I had similar issues on DO. I get less ssh timeout issues adding the following codes to my .ssh/config file Host kube-* Port 22 User core StrictHostKeyChecking=no UserKnownHostsFile=/dev/null

Later on I had some more ansible-playbook errors due to Out of Memory in some steps, and some of the candidates to be killed where kube api server and others. I solved instancing 4gb machines on DO.