Kubeinit / kubeinit

Ansible automation to have a KUBErnetes cluster INITialized as soon as possible...
https://www.kubeinit.org
Apache License 2.0
216 stars 58 forks source link

timed out waiting for ping module test: Failed to connect to the host via ssh: ssh: connect to host 10.0.0.100 port 22: Operation timed out #179

Closed acoard-aot closed 3 years ago

acoard-aot commented 3 years ago

Hey there. Great project. I've been making progress, but recently ran into this error

TASK [../../roles/kubeinit_libvirt : wait for okd-service-01 to boot] *********************************************************
fatal: [hypervisor-01 -> 10.0.0.100]: FAILED! => {"changed": false, "elapsed": 611, "msg": "timed out waiting for ping module test: Failed to connect to the host via ssh: ssh: connect to host 10.0.0.100 port 22: Operation timed out"}

But, from what I can tell, the device is reachable? Here is from my hypervisor (CentOS 8).


[root@prealpha tmp]# virsh list --all
 Id   Name             State
--------------------------------
 2    okd-service-01   running

[root@prealpha tmp]# ping 10.0.0.100
PING 10.0.0.100 (10.0.0.100) 56(84) bytes of data.
64 bytes from 10.0.0.100: icmp_seq=1 ttl=64 time=0.737 ms
64 bytes from 10.0.0.100: icmp_seq=2 ttl=64 time=0.456 ms
64 bytes from 10.0.0.100: icmp_seq=3 ttl=64 time=0.409 ms
^C
--- 10.0.0.100 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 35ms
rtt min/avg/max/mdev = 0.409/0.534/0.737/0.144 ms

For what it's worth, I had another issue I had to fix, network related. I had to modify my /etc/resolv.conf and change my nameserver IP. This was due to a misconfig/old-config of pfSense (my DHCP server). I've tried fixing this config, but I honestly don't think it's related to some SSH/ping over IP.

Any ideas? I suspect maybe I misconfigured my inventory too? Specifically, here are some changes I made:

# CentOS can access the internet via ens192. There is no eth1, but ens192 comes out of the box with my CentOS 8 install.
kubeinit_inventory_network_bridge_external_dev=ens192
# This is the public IP of the pfSense firewall in front of it
kubeinit_inventory_network_bridge_external_ip=xxx.yyy.zz.xx
# This is the LAN IP of the pfSense firewall, on ens192
kubeinit_inventory_network_bridge_external_gateway=192.168.255.1

I also made a few name / domain changes, but those are the relevant ones I believe.

Logs

Lastly, here's a bit longer snippet of my logs when running the playbook in case it helps

TASK [../../roles/kubeinit_libvirt : Create VM definition for the service nodes] **********************************************
changed: [hypervisor-01 -> 207.216.46.92] => {"changed": true, "cmd": "virt-install    --connect qemu:///system    --name=okd-service-01    --memory memory=12288    --cpuset=auto    --vcpus=8,maxvcpus=16    --os-type=linux    --os-variant=rhel8.0    --autostart                            --network network=kimgtnet0,mac=52:54:00:47:94:58,model=virtio                          --graphics none    --noautoconsole    --import    --disk /var/lib/libvirt/images/okd-service-01.qcow2,format=qcow2,bus=virtio\n", "delta": "0:00:04.939186", "end": "2021-02-08 19:15:03.984983", "rc": 0, "start": "2021-02-08 19:14:59.045797", "stderr": "", "stderr_lines": [], "stdout": "\nStarting install...\nDomain creation completed.", "stdout_lines": ["", "Starting install...", "Domain creation completed."]}

TASK [../../roles/kubeinit_libvirt : Create VM definition for the service nodes] **********************************************
skipping: [hypervisor-01] => {"changed": false, "skip_reason": "Conditional result was False"}

TASK [Check that the service node is up and running] **************************************************************************
[WARNING]: The loop variable 'cluster_role_item' is already in use. You should set the `loop_var` value in the `loop_control`
option for the task to something else to avoid variable collisions and unexpected behavior.

TASK [../../roles/kubeinit_libvirt : wait for okd-service-01 to boot] *********************************************************
fatal: [hypervisor-01 -> 10.0.0.100]: FAILED! => {"changed": false, "elapsed": 611, "msg": "timed out waiting for ping module test: Failed to connect to the host via ssh: ssh: connect to host 10.0.0.100 port 22: Operation timed out"}

PLAY RECAP ********************************************************************************************************************
hypervisor-01              : ok=84   changed=18   unreachable=0    failed=1    skipped=23   rescued=0    ignored=3   

Also, here's journactl -u -f libvirtd. I see some references to SELinux; might that be something?

Feb 08 19:12:35 prealpha.openshift.aot-technologies.com systemd[1]: libvirtd.service: Succeeded.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: libvirtd.service: Found left-over process 2013 (dnsmasq) in control group while starting unit. Ignoring.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: libvirtd.service: Found left-over process 2014 (dnsmasq) in control group while starting unit. Ignoring.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: libvirtd.service: Found left-over process 157585 (dnsmasq) in control group while starting unit. Ignoring.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: libvirtd.service: Found left-over process 157586 (dnsmasq) in control group while starting unit. Ignoring.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: Starting Virtualization daemon...
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: Started Virtualization daemon.
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq[157585]: read /etc/hosts - 3 addresses
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq[2013]: read /etc/hosts - 3 addresses
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq[2013]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq[157585]: read /var/lib/libvirt/dnsmasq/kimgtnet0.addnhosts - 0 addresses
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq-dhcp[2013]: read /var/lib/libvirt/dnsmasq/default.hostsfile
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq-dhcp[157585]: read /var/lib/libvirt/dnsmasq/kimgtnet0.hostsfile
Feb 08 19:13:39 prealpha.openshift.aot-technologies.com dnsmasq[157585]: exiting on receipt of SIGTERM
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167683]: listening on kimgtbr0(#11): 10.0.0.254
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167690]: started, version 2.79 cachesize 150
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167690]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth DNSSEC loop-detect inotify
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq-dhcp[167690]: DHCP, IP range 10.0.0.1 -- 10.0.0.253, lease time 1h
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq-dhcp[167690]: DHCP, sockets bound exclusively to interface kimgtbr0
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167690]: using nameserver 10.0.0.100#53
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167690]: read /etc/hosts - 3 addresses
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167690]: read /var/lib/libvirt/dnsmasq/kimgtnet0.addnhosts - 0 addresses
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq-dhcp[167690]: read /var/lib/libvirt/dnsmasq/kimgtnet0.hostsfile
Feb 08 19:14:23 prealpha.openshift.aot-technologies.com libvirtd[163819]: libvirt version: 6.0.0, package: 28.module_el8.3.0+555+a55c8938 (CentOS Buildsys <bugs@centos.org>, 2020-11-04-01:04:00, )
Feb 08 19:14:23 prealpha.openshift.aot-technologies.com libvirtd[163819]: hostname: prealpha.openshift.aot-technologies.com
Feb 08 19:14:23 prealpha.openshift.aot-technologies.com libvirtd[163819]: Domain id=1 name='guestfs-r2s6b7ck88qymrqe' uuid=6fba960a-255d-4baa-9c24-c506801ae5b2 is tainted: custom-argv
Feb 08 19:14:23 prealpha.openshift.aot-technologies.com libvirtd[163819]: Domain id=1 name='guestfs-r2s6b7ck88qymrqe' uuid=6fba960a-255d-4baa-9c24-c506801ae5b2 is tainted: host-cpu
Feb 08 19:14:27 prealpha.openshift.aot-technologies.com libvirtd[163819]: missing device in NIC_RX_FILTER_CHANGED event
Feb 08 19:14:56 prealpha.openshift.aot-technologies.com libvirtd[163819]: 2021-02-09 00:14:56.588+0000: 168939: info : libvirt version: 6.0.0, package: 28.module_el8.3.0+555+a55c8938 (CentOS Buildsys <bugs@centos.org>, 2020-11-04-01:04:00, )
Feb 08 19:14:56 prealpha.openshift.aot-technologies.com libvirtd[163819]: 2021-02-09 00:14:56.588+0000: 168939: info : hostname: prealpha.openshift.aot-technologies.com
Feb 08 19:14:56 prealpha.openshift.aot-technologies.com libvirtd[163819]: 2021-02-09 00:14:56.588+0000: 168939: warning : virSecuritySELinuxRestoreFileLabel:1503 : cannot lookup default selinux label for /tmp/libguestfsO8wacC/console.sock
Feb 08 19:14:56 prealpha.openshift.aot-technologies.com libvirtd[163819]: 2021-02-09 00:14:56.588+0000: 168939: warning : virSecuritySELinuxRestoreFileLabel:1503 : cannot lookup default selinux label for /tmp/libguestfsO8wacC/guestfsd.sock

Thanks!

acoard-aot commented 3 years ago

Oh, I just discovered I can even SSH as root into okd-service-01 from the CentOS Hypervisor w/o password prompt.


[root@prealpha tmp]# ssh core@10.0.0.100
Warning: Permanently added '10.0.0.100' (ECDSA) to the list of known hosts.
core@10.0.0.100: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
[root@prealpha tmp]# ssh root@10.0.0.100
Activate the web console with: systemctl enable --now cockpit.socket

[root@okd-service-01 ~]# echo "I'm in!"

I did this during a fresh attempt at running the playbook, during wait for okd-service-01 to boot.

So, I am able to SSH in via whatever creds are setup with ansible. But ansible is somehow still timing out.

Thanks again.

edit: Does the CentOS hypervisor connect to 10.0.0.100? Or is my laptop running ansible commands supposed to? As my laptop is on a different network, but I could run ansible on a VM in the same network if required.

ccamacho commented 3 years ago

Hi I recently hit something similar and it was some old references to the known_hosts file in the hypervisor, the machine was waiting to accept the "unknown host" warning from ssh. To be able to debug this better I centralized the wait_for_connection in a single task. Can you make sure that you don't have any reference to the machines in the known_hosts file in the host? Also, try with latest master to use this centralized wait_for_connection.

From your 'laptop' you should not be able to reach directly the guests (like ssh root@10.0.0.100), only from the host. When you run the playbook in your laptop you always connect to the hypervisor and jump to the guests using a proxy so as long as you can reach the host it should work.

Please feel free to ask any questions you have, and if you find any issue also feel free to raise issues in the project's repo (https://www.github.com/kubeinit/kubeinit). You can jump in and ask in the slack channel https://kubernetes.slack.com/archives/C01FKK19T0B Also, it would be awesome if you can star the project to catch up with updates and new features.

acoard-aot commented 3 years ago

Thanks for this! Starred the repo. :)

I did have a known_hosts with the offending IP in my CentOS hypervisor node. I removed the entry (leaving an empty known_hosts file) and ran the ansible-playbook again.

It proceeded as normal, timinig out at the same wait for okd-service-01 to boot part. The known_hosts file is still blank. I haven't tried SSHing during the process this time.

# Done during "wait for okd-service-01 to boot"
[root@prealpha ~]# cat ~/.ssh/known_hosts
[root@prealpha ~]#

After waiting for it to timeout, I now pulled the latest master and tried again.

cd kubeinit
git fetch
git pull origin master

My inventory file is unmodified, so I simply will re-run the original ansible-playbook command with the latest master code.

(Note: My hypervisor's known_hosts still exists and is blank at this point.)

The issue persisted, then I saw you modified 20_check_nodes_up.yml, and I switched which task was commented out to the new one you left for me. (Thanks by the way! Super nice of you.)

With the manual replacement for wait_for_connection, the results were slightly different. It failed, then passed, then failed.

TASK [../../roles/kubeinit_libvirt : wait for okd-service-01 to boot] ******************************************************
FAILED - RETRYING: wait for okd-service-01 to boot (30 retries left).
FAILED - RETRYING: wait for okd-service-01 to boot (29 retries left).
changed: [hypervisor-01] => {"attempts": 3, "changed": true, "cmd": "set -o pipefail\nssh    -o ConnectTimeout=5    -o BatchMode=yes    -o StrictHostKeyChecking=no    root@10.0.0.100 'echo \"connected\"' || true\n", "delta": "0:00:02.189656", "end": "2021-02-09 11:27:29.637112", "rc": 0, "start": "2021-02-09 11:27:27.447456", "stderr": "Warning: Permanently added '10.0.0.100' (ECDSA) to the list of known hosts.", "stderr_lines": ["Warning: Permanently added '10.0.0.100' (ECDSA) to the list of known hosts."], "stdout": "connected", "stdout_lines": ["connected"]}

TASK [Upgrade packages and restart] ****************************************************************************************
[WARNING]: The loop variable 'cluster_role_item' is already in use. You should set the `loop_var` value in the
`loop_control` option for the task to something else to avoid variable collisions and unexpected behavior.

TASK [../../roles/kubeinit_okd : update packages] **************************************************************************
fatal: [hypervisor-01]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 10.0.0.100 port 22: Operation timed out", "unreachable": true}

PLAY RECAP *****************************************************************************************************************
hypervisor-01              : ok=86   changed=20   unreachable=1    failed=0    skipped=23   rescued=0    ignored=3   

At this point, if I cat ~/.ssh/known_hosts I do see an entry for 10.0.0.100 (as per the wait for okd-service-01 to boot task). And if on the hypervisor I do ssh root@10.0.0.100, it works fine. So, the entry in known_host seems correct.

I'm stumped, any ideas? I might jump on Slack if you're available today, although I have to get an OpenShift deployment up ASAP so I may just go to a manual bare-metal single-node-cluster installation if I can't make more progress on this soon.

Thanks again, any advice is appreciated.

ccamacho commented 3 years ago

After checking and checking the code @acoard-aot the problem is that Ansible is not honoring the ssh proxy (that's why it can not connect). In this case, each guest is inside a libvirt network inside the host, so from the host itself you will be able to reach it, but not from the place you trigger ansible (that's the reason of the ssh proxy). The weird thing is that this is something I used since 1 year ago and it was working just fine. After checking the Ansible docs, I found another way of configuring the ssh proxy.

Fixed by: https://github.com/Kubeinit/kubeinit/pull/182

nirolfa commented 3 years ago

same issue but fix #182 didt fix it for me i had to comment out those 2 lines; in kubeinit/hosts/okd/inventory ansible_ssh_pipelining=True ansible_ssh_extra_args='-o StrictHostKeyChecking=no -o ProxyCommand="ssh -W %h:%p -q root@nyctea"' # ControlMaster=auto -o ControlPersist=54s -o ControlPath=~/.ssh/ansible-%r@%h:%p

ccamacho commented 3 years ago

@nirolfa thanks for reporting this, it should work if you deploy the ansible playbook from the first hypervisor (I assume its the way it works for you).

In the latest merged PR @nirolfa @acoard-aot this should be fixed now, there are some minor improvements that need to be addressed but it should work.

Please feel free to reopen or jump in the slack channel if you have any other question.