geerlingguy / ansible-role-kubernetes

Ansible Role - Kubernetes
https://galaxy.ansible.com/geerlingguy/kubernetes/
MIT License
554 stars 268 forks source link

Install fails ubuntu / centos 3 node (1 master, 2 node) #11

Closed tonyppe closed 4 years ago

tonyppe commented 6 years ago

Hi there, I tried to deploy this onto 3 x ubuntu 16 instances but the result is that the kubernetes api and other services do not start. The error is actually at the point where it tries to download and import the yaml, because it cannot connect to port 6443 (because the service is not starting).

so I destroyed those 3 instances and then deployed centos 7. Again, install fails, I am trying to debug. Are you aware of these issues? Any suggestions? Both ubuntu and centos are vanilla and are running on openstack.

tonyppe commented 6 years ago

I was able to get the service started on the master but then later on the install fails again with an error:

{ "msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stdout'\n\nThe error appears to have been in '/projects/_17__infrastructure_spinnaker/roles/geerlingguy.kubernetes/tasks/node-setup.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: Join node to Kubernetes master\n ^ here\n" }

davidcodesido commented 6 years ago

I had this issue today too but I was running separately the installation in the master and the nodes. When I run all together from local, the node installation waits for the master to finish and the output of the join command is then available for the nodes. Basically this task:

# Set up nodes.
- name: Get the kubeadm join command from the Kubernetes master.
  shell: kubeadm token create --print-join-command
  changed_when: False
  when: kubernetes_role == 'master'
  run_once: True
  register: kubernetes_join_command

Check if that's your case too.

geerlingguy commented 6 years ago

Yeah, that is one assumption I've made that can trip people up; I'm assuming this role is always being run from a host outside the infra, e.g. my laptop runs it, sets up the master node, then sets up the other nodes, all in one go. If you just run it against one node or from the nodes themselves, it will definitely fail.

geerlingguy commented 6 years ago

For a canonical example (there's also a Vagrantfile that can be used for local testing), see: https://github.com/geerlingguy/raspberry-pi-dramble/tree/kubernetes (work in progress)

tonyppe commented 6 years ago

I'm going to try this again and use a single instance for running the master and worker. When I raised this I was trying to do the master and two workers in the one deployment from ansible (actually ansible tower). The way in which I have configured this to run is that ansible tower runs the ansible-playbook command and this then ssh into the master node, runs the install and then does the workers.

I'm about ready to run this again onto a single instance just now. I was delayed because I was caught up reading this https://www.jeffgeerling.com/blog/2018/kubernetes-complexity , specifically about the compromised docker hub images. I tried to find some known infected docker hub images so that I could paste their direct URL into this tool: https://virustotal.com/ . I want to see if this tool can alert for malicious code within before someone downloads and runs the image.

tonyppe commented 6 years ago

Install goes without issue using an all-in-one instance deployment. Cheers

geerlingguy commented 6 years ago

Possibly related: #10

h0x91b commented 5 years ago

In my case, it was because of kubelet failed to start due to:

F1030 11:01:13.231850    8210 server.go:273] failed to run Kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver: "system" is different from docker cgroup driver: "systemd"

Fixed it by adding var:

kubernetes_kubelet_extra_args: '--cgroup-driver=systemd'
stale[bot] commented 4 years ago

This issue has been marked 'stale' due to lack of recent activity. If there is no further activity, the issue will be closed in another 30 days. Thank you for your contribution!

Please read this blog post to see the reasons why I mark issues as stale.

geerlingguy commented 4 years ago

Closing as the typical fix is to check your kubelet logs on the machine where there are issues and fix those issues. I think there are enough hints in this issue to help anyone else with similar problems.