Closed acoard-aot closed 3 years ago
Oh, I just discovered I can even SSH as root into okd-service-01
from the CentOS Hypervisor w/o password prompt.
[root@prealpha tmp]# ssh core@10.0.0.100
Warning: Permanently added '10.0.0.100' (ECDSA) to the list of known hosts.
core@10.0.0.100: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
[root@prealpha tmp]# ssh root@10.0.0.100
Activate the web console with: systemctl enable --now cockpit.socket
[root@okd-service-01 ~]# echo "I'm in!"
I did this during a fresh attempt at running the playbook, during wait for okd-service-01 to boot
.
So, I am able to SSH in via whatever creds are setup with ansible. But ansible is somehow still timing out.
Thanks again.
edit: Does the CentOS hypervisor connect to 10.0.0.100? Or is my laptop running ansible commands supposed to? As my laptop is on a different network, but I could run ansible on a VM in the same network if required.
Hi I recently hit something similar and it was some old references to the known_hosts file in the hypervisor, the machine was waiting to accept the "unknown host" warning from ssh. To be able to debug this better I centralized the wait_for_connection in a single task. Can you make sure that you don't have any reference to the machines in the known_hosts file in the host? Also, try with latest master to use this centralized wait_for_connection.
From your 'laptop' you should not be able to reach directly the guests (like ssh root@10.0.0.100), only from the host. When you run the playbook in your laptop you always connect to the hypervisor and jump to the guests using a proxy so as long as you can reach the host it should work.
Please feel free to ask any questions you have, and if you find any issue also feel free to raise issues in the project's repo (https://www.github.com/kubeinit/kubeinit). You can jump in and ask in the slack channel https://kubernetes.slack.com/archives/C01FKK19T0B Also, it would be awesome if you can star the project to catch up with updates and new features.
Thanks for this! Starred the repo. :)
I did have a known_hosts
with the offending IP in my CentOS hypervisor node. I removed the entry (leaving an empty known_hosts
file) and ran the ansible-playbook again.
It proceeded as normal, timinig out at the same wait for okd-service-01 to boot
part. The known_hosts
file is still blank. I haven't tried SSHing during the process this time.
# Done during "wait for okd-service-01 to boot"
[root@prealpha ~]# cat ~/.ssh/known_hosts
[root@prealpha ~]#
After waiting for it to timeout, I now pulled the latest master and tried again.
cd kubeinit
git fetch
git pull origin master
My inventory file is unmodified, so I simply will re-run the original ansible-playbook command with the latest master code.
(Note: My hypervisor's known_hosts
still exists and is blank at this point.)
The issue persisted, then I saw you modified 20_check_nodes_up.yml
, and I switched which task was commented out to the new one you left for me. (Thanks by the way! Super nice of you.)
With the manual replacement for wait_for_connection
, the results were slightly different. It failed, then passed, then failed.
TASK [../../roles/kubeinit_libvirt : wait for okd-service-01 to boot] ******************************************************
FAILED - RETRYING: wait for okd-service-01 to boot (30 retries left).
FAILED - RETRYING: wait for okd-service-01 to boot (29 retries left).
changed: [hypervisor-01] => {"attempts": 3, "changed": true, "cmd": "set -o pipefail\nssh -o ConnectTimeout=5 -o BatchMode=yes -o StrictHostKeyChecking=no root@10.0.0.100 'echo \"connected\"' || true\n", "delta": "0:00:02.189656", "end": "2021-02-09 11:27:29.637112", "rc": 0, "start": "2021-02-09 11:27:27.447456", "stderr": "Warning: Permanently added '10.0.0.100' (ECDSA) to the list of known hosts.", "stderr_lines": ["Warning: Permanently added '10.0.0.100' (ECDSA) to the list of known hosts."], "stdout": "connected", "stdout_lines": ["connected"]}
TASK [Upgrade packages and restart] ****************************************************************************************
[WARNING]: The loop variable 'cluster_role_item' is already in use. You should set the `loop_var` value in the
`loop_control` option for the task to something else to avoid variable collisions and unexpected behavior.
TASK [../../roles/kubeinit_okd : update packages] **************************************************************************
fatal: [hypervisor-01]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 10.0.0.100 port 22: Operation timed out", "unreachable": true}
PLAY RECAP *****************************************************************************************************************
hypervisor-01 : ok=86 changed=20 unreachable=1 failed=0 skipped=23 rescued=0 ignored=3
At this point, if I cat ~/.ssh/known_hosts
I do see an entry for 10.0.0.100 (as per the wait for okd-service-01 to boot
task). And if on the hypervisor I do ssh root@10.0.0.100
, it works fine. So, the entry in known_host
seems correct.
I'm stumped, any ideas? I might jump on Slack if you're available today, although I have to get an OpenShift deployment up ASAP so I may just go to a manual bare-metal single-node-cluster installation if I can't make more progress on this soon.
Thanks again, any advice is appreciated.
After checking and checking the code @acoard-aot the problem is that Ansible is not honoring the ssh proxy (that's why it can not connect). In this case, each guest is inside a libvirt network inside the host, so from the host itself you will be able to reach it, but not from the place you trigger ansible (that's the reason of the ssh proxy). The weird thing is that this is something I used since 1 year ago and it was working just fine. After checking the Ansible docs, I found another way of configuring the ssh proxy.
same issue but fix #182 didt fix it for me i had to comment out those 2 lines; in kubeinit/hosts/okd/inventory ansible_ssh_pipelining=True ansible_ssh_extra_args='-o StrictHostKeyChecking=no -o ProxyCommand="ssh -W %h:%p -q root@nyctea"' # ControlMaster=auto -o ControlPersist=54s -o ControlPath=~/.ssh/ansible-%r@%h:%p
@nirolfa thanks for reporting this, it should work if you deploy the ansible playbook from the first hypervisor (I assume its the way it works for you).
In the latest merged PR @nirolfa @acoard-aot this should be fixed now, there are some minor improvements that need to be addressed but it should work.
Please feel free to reopen or jump in the slack channel if you have any other question.
Hey there. Great project. I've been making progress, but recently ran into this error
But, from what I can tell, the device is reachable? Here is from my hypervisor (CentOS 8).
For what it's worth, I had another issue I had to fix, network related. I had to modify my
/etc/resolv.conf
and change my nameserver IP. This was due to a misconfig/old-config of pfSense (my DHCP server). I've tried fixing this config, but I honestly don't think it's related to some SSH/ping over IP.Any ideas? I suspect maybe I misconfigured my inventory too? Specifically, here are some changes I made:
I also made a few name / domain changes, but those are the relevant ones I believe.
Logs
Lastly, here's a bit longer snippet of my logs when running the playbook in case it helps
Also, here's
journactl -u -f libvirtd
. I see some references to SELinux; might that be something?Thanks!