Closed Gl1TcH-1n-Th3-M4tR1x closed 2 years ago
I was able to reproduce this in the CI, I was only able to provision the hosts with F35 so the issue is reproduced out of the box.
[root@nyctea ~]# podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
12e9c53c1d0b k8s.gcr.io/pause:3.5 6 minutes ago Up 6 minutes ago 0.0.0.0:26973->8000/tcp f00f36a649b7-infra
964c5f7879e7 docker.io/recordsansible/ara-api:latest bash -c /usr/loca... 6 minutes ago Up 6 minutes ago 0.0.0.0:26973->8000/tcp api-server
d2718ae04ac9 k8s.gcr.io/pause:3.5 4 minutes ago Up 4 minutes ago ef1044587dd6-infra
af205bdd8a1a localhost/kubeinit/k8scluster-credentials:latest sleep infinity 3 minutes ago Up 3 minutes ago k8scluster-credentials
[root@nyctea ~]# podman pod list
POD ID NAME STATUS CREATED INFRA ID # OF CONTAINERS
ef1044587dd6 k8scluster-service-pod Running 4 minutes ago d2718ae04ac9 2
f00f36a649b7 ara-pod Running 6 minutes ago 12e9c53c1d0b 2
[root@nyctea ~]# podman port -a
12e9c53c1d0b 8000/tcp -> 0.0.0.0:26973
964c5f7879e7 8000/tcp -> 0.0.0.0:26973
[root@nyctea ~]# podman ps -a --pod
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES POD ID PODNAME
12e9c53c1d0b k8s.gcr.io/pause:3.5 7 minutes ago Up 7 minutes ago 0.0.0.0:26973->8000/tcp f00f36a649b7-infra f00f36a649b7 ara-pod
964c5f7879e7 docker.io/recordsansible/ara-api:latest bash -c /usr/loca... 7 minutes ago Up 7 minutes ago 0.0.0.0:26973->8000/tcp api-server f00f36a649b7 ara-pod
d2718ae04ac9 k8s.gcr.io/pause:3.5 4 minutes ago Up 4 minutes ago ef1044587dd6-infra ef1044587dd6 k8scluster-service-pod
af205bdd8a1a localhost/kubeinit/k8scluster-credentials:latest sleep infinity 3 minutes ago Up 3 minutes ago k8scluster-credentials ef1044587dd6 k8scluster-service-pod
@Gl1TcH-1n-Th3-M4tR1x @gmarcy im not sure what is wrong here, Ill try to test F34 to see if there is something different there.
@Gl1TcH-1n-Th3-M4tR1x could you also tell us what the ansible playbook client side environment is like? I can do the git clone on the same F35 host that I'm deploying to, similar to how the github actions run, but would like to match. other thought would be to do the clone and run the container build and run the install from the container, just to rule out any issues in the host running the playbook.
The Ansible playbook client is the same Fedora 35 hypervisor-01, basically I run the ansible playbook from the Hypervisor, all I have is an alias for nyctea on my hypervisor-01. Let me try to run the build from the container.
@Gl1TcH-1n-Th3-M4tR1x are you running the ansible-playbook under root or a non-root user?
@Gl1TcH-1n-Th3-M4tR1x found this in my fedora kickstart, wondering if you updated fedora to accept RSA keys?
update-crypto-policies --set DEFAULT:FEDORA32
I had forgotten about this as I had switched from rsa to ed25519 when I added support for the KUBEINIT_COMMON_SSH_KEYTYPE
environment variable.
$ export | grep SSH_KEYTYPE
declare -x KUBEINIT_COMMON_SSH_KEYTYPE="ed25519"
More info here
@Gl1TcH-1n-Th3-M4tR1x are you running the ansible-playbook under root or a non-root user?
I'm running the playbook with a non-root user.
Running from the Container failed even faster:
TASK [kubeinit.kubeinit.kubeinit_prepare : Put secret values into a dictionary] *************************************************************************************************************
task path: /home/kiuser/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/gather_kubeinit_secrets.yml:39
skipping: [localhost] => (item=None) => {
"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
"changed": false
}
skipping: [localhost] => (item=None) => {
"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
"changed": false
}
skipping: [localhost] => (item=None) => {
"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
"changed": false
}
skipping: [localhost] => (item=None) => {
"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
"changed": false
}
skipping: [localhost] => (item=None) => {
"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
"changed": false
}
skipping: [localhost] => (item=None) => {
"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
"changed": false
}
skipping: [localhost] => (item=None) => {
"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
"changed": false
}
skipping: [localhost] => {
"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
"changed": false
}
TASK [kubeinit.kubeinit.kubeinit_prepare : Add secrets to kubeinit secrets] *****************************************************************************************************************
task path: /home/kiuser/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/gather_kubeinit_secrets.yml:49
fatal: [localhost]: FAILED! => {
"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result"
}
NO MORE HOSTS LEFT **************************************************************************************************************************************************************************
PLAY RECAP **********************************************************************************************************************************************************************************
localhost : ok=6 changed=1 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
@Gl1TcH-1n-Th3-M4tR1x found this in my fedora kickstart, wondering if you updated fedora to accept RSA keys?
update-crypto-policies --set DEFAULT:FEDORA32
I had forgotten about this as I had switched from rsa to ed25519 when I added support for the
KUBEINIT_COMMON_SSH_KEYTYPE
environment variable.$ export | grep SSH_KEYTYPE declare -x KUBEINIT_COMMON_SSH_KEYTYPE="ed25519"
More info here
Updated Fedora 35 crypto policies, and set KUBEINIT_COMMON_SSH_KEYTYPE to ed25519, and I've got:
TASK [kubeinit.kubeinit.kubeinit_prepare : Confirm that we have ansible host connectivity] **************************************************************************************************
task path: /home/german/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/gather_host_facts.yml:21
Thursday 10 February 2022 04:28:00 -0500 (0:00:00.074) 0:00:01.429 *****
Using module file /usr/local/lib/python3.10/site-packages/ansible/modules/ping.py
Pipelining is enabled.
<nyctea> ESTABLISH SSH CONNECTION FOR USER: root
<nyctea> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/id_ed25519' -o 'ControlPath="/home/german/.ansible/cp/19b2ae1270"' nyctea '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
fatal: [localhost -> hypervisor-01]: UNREACHABLE! => {
"changed": false,
"msg": "Data could not be sent to remote host \"nyctea\". Make sure this host can be reached over ssh: Warning: Identity file /home/german/.ssh/id_ed25519 not accessible: No such file or directory.\nBad owner or permissions on /home/german/.ssh/config\r\n",
"unreachable": true
}
PLAY RECAP **********************************************************************************************************************************************************************************
localhost : ok=37 changed=6 unreachable=1 failed=0 skipped=11 rescued=0 ignored=0
Reviewed from the beginning, as I'm not able to reproduce, even with the CI installed on the F35 box. Noted you are still using the old edit /etc/hosts mechanism for adding your own host. We updated the docs to use ssh config files as described here in the README but that was more to do with not needing to change system configuration files.
I've never used the /etc/hosts approach which was why I added the ssh config support. I hadn't mentioned it before as I know that the scripts Carlos uses to create the gitlab runners still use /etc/hosts so I assumed it was still working. Am afraid I am grasping at straws as I have no issues with this in my homelab environment with F35.
Running from the Container failed even faster:
I am so sorry... I've fallen behind on keeping the docs up to date with the code. We switched from volume mounts to podman secrets for ssh keys, so instead of mounting ~/.ssh keys in the container you would do
podman secret create kubeinit_ssh_key ~/.ssh/id_<keytype>
podman run --secret kubeinit_ssh_key ...
that would replace the params
-v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z \
-v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z \
Running from the Container failed even faster:
I am so sorry... I've fallen behind on keeping the docs up to date with the code. We switched from volume mounts to podman secrets for ssh keys, so instead of mounting ~/.ssh keys in the container you would do
podman secret create kubeinit_ssh_key ~/.ssh/id_<keytype> podman run --secret kubeinit_ssh_key ...
that would replace the params
-v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z \ -v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z \
After creating nyctea as a Virtual machine with CentOS8-Stream and a podman secret with:
podman secret create kubeinit_ssh_key ~/.ssh/id_rsa
And running:
podman run --rm -it --secret kubeinit_ssh_key -v ~/.ssh/config:/root/.ssh/config.z kubeinit/kubeinit -vvv --user root -e kubeinit_spec=okd-libvirt-3-1-1 -i ./kubeinit/inventory ./kubeinit/playbook.yml
Kubeinit went all the way to:
TASK [kubeinit.kubeinit.kubeinit_libvirt : Wait for changes to propagate] *************************************************************************
task path: /home/kiuser/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/cleanup_libvirt.yml:132
Using module file /home/kiuser/.local/lib/python3.9/site-packages/ansible/modules/command.py
Pipelining is enabled.
<nyctea> ESTABLISH SSH CONNECTION FOR USER: root
<nyctea> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/id_rsa' -o 'ControlPath="/home/kiuser/.ansible/cp/19b2ae1270"' nyctea '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<nyctea> (1, b'\n{"changed": true, "stdout": "", "stderr": "2022-02-10T21:37:09Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)", "rc": -14, "cmd": ["/usr/bin/ovn-nbctl", "--wait=hv", "--timeout=30", "sync"], "start": "2022-02-10 16:36:39.303190", "end": "2022-02-10 16:37:09.338535", "delta": "0:00:30.035345", "failed": true, "msg": "non-zero return code", "invocation": {"module_args": {"_raw_params": "/usr/bin/ovn-nbctl --wait=hv --timeout=30 sync", "_uses_shell": false, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "executable": null, "creates": null, "removes": null, "stdin": null}}}\n', b'')
<nyctea> Failed to connect to the host via ssh:
fatal: [localhost -> hypervisor-01(nyctea)]: FAILED! => {
"changed": false,
"cmd": [
"/usr/bin/ovn-nbctl",
"--wait=hv",
"--timeout=30",
"sync"
],
"delta": "0:00:30.035345",
"end": "2022-02-10 16:37:09.338535",
"invocation": {
"module_args": {
"_raw_params": "/usr/bin/ovn-nbctl --wait=hv --timeout=30 sync",
"_uses_shell": false,
"argv": null,
"chdir": null,
"creates": null,
"executable": null,
"removes": null,
"stdin": null,
"stdin_add_newline": true,
"strip_empty_ends": true,
"warn": false
}
},
"msg": "non-zero return code",
"rc": -14,
"start": "2022-02-10 16:36:39.303190",
"stderr": "2022-02-10T21:37:09Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)",
"stderr_lines": [
"2022-02-10T21:37:09Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)"
],
"stdout": "",
"stdout_lines": []
}
PLAY RECAP ****************************************************************************************************************************************
hypervisor-01 : ok=17 changed=3 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0
hypervisor-02 : ok=0 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
localhost : ok=173 changed=40 unreachable=0 failed=1 skipped=51 rescued=0 ignored=0
@Gl1TcH-1n-Th3-M4tR1x I was able to deploy a cluster from the container built from main using the following command
podman run --rm -ti -e KUBEINIT_COMMON_SSH_KEYTYPE --secret kubeinit_ssh_key kubeinit/kubeinit -e kubeinit_hypervisor_hosts_spec='[[host=hypervisor-01,ssh_hostname=192.168.0.10]]' -vvv --user root -e kubeinit_spec=okd-libvirt-3-1-1 -i ./kubeinit/inventory ./kubeinit/playbook.yml
The replacement for the ssh config volume mount
-v ~/.ssh/config:/root/.ssh/config.z
was
-e kubeinit_hypervisor_hosts_spec='[[host=hypervisor-01,ssh_hostname=192.168.0.10]]'
Apologies again for not keeping up with the docs on this change. When we switched to running kubeinit/kubeinit as a non-root container it made volume mounts unreliable and so they were replaced with a new mechanism.
update-crypto-policies --set DEFAULT:FEDORA32
This fixed the issue in my F35 box, from the standard deployment methods without containers.
Ok, new scenario, dedicated a CentOS-Stream as the hypervisor with 64 Cores, 256GB Ram and 4TB hdd, created a VM (in another machine) as the deployment box, launched both ansible-playbook and podman container and both failed at:
TASK [kubeinit.kubeinit.kubeinit_libvirt : Create VM definition for controller-01] **************************************************************************
task path: /home/kiuser/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/deploy_coreos_guest.yml:31
Using module file /home/kiuser/.local/lib/python3.9/site-packages/ansible/modules/command.py
Using module file /home/kiuser/.local/lib/python3.9/site-packages/ansible/modules/command.py
Pipelining is enabled.
<10.0.0.253> ESTABLISH SSH CONNECTION FOR USER: root
<10.0.0.253> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthention=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/okdcluster_id_rsa' -o 'ProxyCommanssh/okdcluster_id_rsa -W %h:%p -q root@nyctea' -o 'ControlPath="/home/kiuser/.ansible/cp/2a631e4199"' 10.0.0.253 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0
<10.0.0.253> (1, b'\n{"changed": true, "stdout": "", "stderr": "Unable to connect to the server: EOF", "rc": 1, "cmd": "set -o pipefail\\nexport KUBECONFIG=~, "start": "2022-02-14 20:58:05.301347", "end": "2022-02-14 20:58:55.396760", "delta": "0:00:50.095413", "failed": true, "msg": "non-zero return code", "invofail\\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \\" Ready\\"\\n", "_uses_shell": true, "warn": false, "stdin_add_nenull, "removes": null, "stdin": null}}}\n', b"Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.\r\n")
<10.0.0.253> Failed to connect to the host via ssh: Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.
FAILED - RETRYING: [localhost -> service]: Verify that controller nodes are ok (55 retries left).Result was: {
"attempts": 6,
"changed": false,
"cmd": "set -o pipefail\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \" Ready\"\n",
"delta": "0:00:50.095413",
"end": "2022-02-14 20:58:55.396760",
"invocation": {
"module_args": {
"_raw_params": "set -o pipefail\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \" Ready\"\n",
"_uses_shell": true,
"argv": null,
"chdir": null,
"creates": null,
"executable": "/bin/bash",
"removes": null,
"stdin": null,
"stdin_add_newline": true,
"strip_empty_ends": true,
"warn": false
}
},
"msg": "non-zero return code",
"rc": 1,
"retries": 61,
"start": "2022-02-14 20:58:05.301347",
"stderr": "Unable to connect to the server: EOF",
"stderr_lines": [
"Unable to connect to the server: EOF"
],
"stdout": "",
"stdout_lines": []
}
In order to check the connectivity, logged into okdcluster-provision container and ran:
export KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes
oc get nodes
does not respond
update-crypto-policies --set DEFAULT:FEDORA32
This fixed the issue in my F35 box, from the standard deployment methods without containers.
Carlos, could you please detail all the step you used to have the cluster running in FC35?
Hi @Gl1TcH-1n-Th3-M4tR1x this is what I did (what is running in the CI).
If you see the script, the only 'new' thing is this command update-crypto-policies --set DEFAULT:FEDORA32
.
This is an example of a successful CI job with the previous steps https://storage.googleapis.com/kubeinit-ci/jobs/okd-libvirt-1-1-1-h-periodic-pid-weekly-u/records/1.html running the previous steps.
@gmarcy did an amazing job putting these prepare steps in a playbook2 but I didn't find time to integrate it in the CI.
After Deploy a fresh F35 machine my steps are:
$ cat /etc/os-release
NAME="Fedora Linux"
VERSION="35 (Thirty Five)"
ID=fedora
VERSION_ID=35
...
$ update-crypto-policies --show
DEFAULT:FEDORA32
$ sudo dnf install -y git podman
...
$ git --version
git version 2.35.1
$ podman --version
podman version 3.4.4
$ export KUBEINIT_COMMON_SSH_KEYTYPE="ed25519"
$ ssh-keygen -t ed25519
...
$ sudo mkdir ~root/.ssh
$ sudo chmod 700 ~root/.ssh
$ sudo cp ~/.ssh/id_ed25519.pub ~root/.ssh/authorized_keys
$ ssh root@<ip-address> python3 -V
Python 3.10.0
$ podman secret create kubeinit_ssh_key ~/.ssh/id_ed25519
...
$ git clone https://github.com/Kubeinit/kubeinit.git
...
$ cd kubeinit
$ podman build -t kubeinit/kubeinit .
...
$ podman run --rm -ti -e KUBEINIT_COMMON_SSH_KEYTYPE --secret kubeinit_ssh_key kubeinit/kubeinit -e hypervisor_hosts_spec='[[host=hypervisor-01,ssh_hostname=<ip-address>]]' -vvv --user root -e kubeinit_spec=okd-libvirt-3-1-1 -i ./kubeinit/inventory ./kubeinit/playbook.yml
...
I would be very interested in knowing if this does not work properly for you since this is the direction we expect future updates to take.
After Deploy a fresh F35 machine my steps are:
$
cat /etc/os-release
NAME="Fedora Linux" VERSION="35 (Thirty Five)" ID=fedora VERSION_ID=35 ... $update-crypto-policies --show
DEFAULT:FEDORA32 $sudo dnf install -y git podman
... $git --version
git version 2.35.1 $podman --version
podman version 3.4.4 $export KUBEINIT_COMMON_SSH_KEYTYPE="ed25519"
$ssh-keygen -t ed25519
... $sudo mkdir ~root/.ssh
$sudo chmod 700 ~root/.ssh
$sudo cp ~/.ssh/id_ed25519.pub ~root/.ssh/authorized_keys
$ssh root@<ip-address> python3 -V
Python 3.10.0 $podman secret create kubeinit_ssh_key ~/.ssh/id_ed25519
... $git clone https://github.com/Kubeinit/kubeinit.git
... $cd kubeinit
$podman build -t kubeinit/kubeinit .
... $podman run --rm -ti -e KUBEINIT_COMMON_SSH_KEYTYPE --secret kubeinit_ssh_key kubeinit/kubeinit -e hypervisor_hosts_spec='[[host=hypervisor-01,ssh_hostname=<ip-address>]]' -vvv --user root -e kubeinit_spec=okd-libvirt-3-1-1 -i ./kubeinit/inventory ./kubeinit/playbook.yml
...I would be very interested in knowing if this does not work properly for you since this is the direction we expect future updates to take.
Here what I did, Created a brand new Bare-Metal FC35 Workstation with 32 Cores, 256GB RAM and 4 TB HDD, followed your instructions to the letter and after running the podman run command, it got stuck exactly at the same step as before:
TASK [kubeinit.kubeinit.kubeinit_okd : Verify that controller nodes are ok] *******************************
task path: /home/kiuser/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_okd/tasks/main.yml:41
Using module file /home/kiuser/.local/lib/python3.9/site-packages/ansible/modules/command.py
Pipelining is enabled.
<10.0.0.253> ESTABLISH SSH CONNECTION FOR USER: root
<10.0.0.253> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/okdcluster_id_ed25519' -o 'ProxyCommand=ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i ~/.ssh/okdcluster_id_ed25519 -W %h:%p -q root@nyctea' -o 'ControlPath="/home/kiuser/.ansible/cp/2a631e4199"' 10.0.0.253 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<10.0.0.253> (1, b'\n{"changed": true, "stdout": "", "stderr": "Unable to connect to the server: EOF", "rc": 1, "cmd": "set -o pipefail\\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \\" Ready\\"\\n", "start": "2022-02-15 21:12:20.071571", "end": "2022-02-15 21:13:10.182005", "delta": "0:00:50.110434", "failed": true, "msg": "non-zero return code", "invocation": {"module_args": {"executable": "/bin/bash", "_raw_params": "set -o pipefail\\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \\" Ready\\"\\n", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "creates": null, "removes": null, "stdin": null}}}\n', b"Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.\r\n")
<10.0.0.253> Failed to connect to the host via ssh: Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.
FAILED - RETRYING: [localhost -> service]: Verify that controller nodes are ok (60 retries left).Result was: {
"attempts": 1,
"changed": false,
"cmd": "set -o pipefail\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \" Ready\"\n",
"delta": "0:00:50.110434",
"end": "2022-02-15 21:13:10.182005",
"invocation": {
"module_args": {
"_raw_params": "set -o pipefail\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \" Ready\"\n",
"_uses_shell": true,
"argv": null,
"chdir": null,
"creates": null,
"executable": "/bin/bash",
"removes": null,
"stdin": null,
"stdin_add_newline": true,
"strip_empty_ends": true,
"warn": false
}
},
"msg": "non-zero return code",
"rc": 1,
"retries": 61,
"start": "2022-02-15 21:12:20.071571",
"stderr": "Unable to connect to the server: EOF",
"stderr_lines": [
"Unable to connect to the server: EOF"
],
"stdout": "",
"stdout_lines": []
}
Created a brand new Bare-Metal FC35 Workstation
Server or Workstation?
I'm booting from
Fedora-Server-dvd-x86_64-35-1.2.iso
Minimal package install - anaconda-ks.cfg has
%packages
@^custom-environment
@standard
%end
Just trying to understand where the differences are coming from.
My output for the same task is identical to yours, but my response is
<10.0.0.253> (1, b'\n{"changed": true, "stdout": "", "stderr": "No resources found", "rc": 1, "cmd": "set -o pipefail\\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \\" Ready\\"\\n", "start": "2022-02-15 18:39:41.902631", "end": "2022-02-15 18:39:42.249062", "delta": "0:00:00.346431", "failed": true, "msg": "non-zero return code", "invocation": {"module_args": {"executable": "/bin/bash", "_raw_params": "set -o pipefail\\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \\" Ready\\"\\n", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "creates": null, "removes": null, "stdin": null}}}\n', b"Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.\r\n")
<10.0.0.253> Failed to connect to the host via ssh: Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.
FAILED - RETRYING: [localhost -> service]: Verify that controller nodes are ok (60 retries left).Result was: {
I'm going to try installing from Fedora-Workstation-Live-x86_64-35-1.2.iso to see if I can reproduce your failure.
I'm using Fedora-Workstation-Live-x86_64-35-1.2.iso
Not sure, but it looks like the coreOS VMs are not getting ipaddresses:
virsh # list
Id Name State
------------------------------------------
6 okdcluster-bootstrap running
8 okdcluster-controller-01 running
10 okdcluster-controller-02 running
12 okdcluster-controller-03 running
virsh # domiflist okdcluster-bootstrap
Interface Type Source Model MAC
-------------------------------------------------------------------
veth0-0a000005 bridge kimgtnet0 virtio 52:54:00:c0:6f:53
virsh # domiflist okdcluster-controller-01
Interface Type Source Model MAC
-------------------------------------------------------------------
veth0-0a000001 bridge kimgtnet0 virtio 52:54:00:35:a0:d6
virsh # domiflist okdcluster-controller-02
Interface Type Source Model MAC
-------------------------------------------------------------------
veth0-0a000002 bridge kimgtnet0 virtio 52:54:00:4b:00:a8
virsh # domiflist okdcluster-controller-03
Interface Type Source Model MAC
-------------------------------------------------------------------
veth0-0a000003 bridge kimgtnet0 virtio 52:54:00:9b:04:11
virsh # domifaddr okdcluster-bootstrap
Name MAC address Protocol Address
-------------------------------------------------------------------------------
virsh # domifaddr okdcluster-controller-01
Name MAC address Protocol Address
-------------------------------------------------------------------------------
virsh # domifaddr okdcluster-controller-02
Name MAC address Protocol Address
-------------------------------------------------------------------------------
virsh # domifaddr okdcluster-controller-03
Name MAC address Protocol Address
-------------------------------------------------------------------------------
virsh #
Still trying to get anything working on F35 Workstation, behaves like a different operating system than F35 Server. Noticing in particular that libvirtd is often inactive even when there are virtual machines running, so we don't always cleanup old virtual machines.
FYI, this is my kickstart output in case anything different jumps out to you
# Generated by Anaconda 35.22.2
# Generated by pykickstart v3.34
#version=DEVEL
# Use graphical install
graphical
# Keyboard layouts
keyboard --vckeymap=us --xlayouts='us'
# System language
lang en_US.UTF-8
%packages
@^workstation-product-environment
%end
# Run the Setup Agent on first boot
firstboot --enable
# Generated using Blivet version 3.4.2
ignoredisk --only-use=nvme0n1
autopart
# Partition clearing information
clearpart --none --initlabel
# System timezone
timezone America/New_York --utc
#Root password
rootpw --lock
Where do the VMs get the IP Address from? when created, the VMs are configured to use dhcp and get connected to br-int bridge.
IIRC, it's a combination of things...
# ovn-nbctl show
switch 7c611c5c-be12-4382-a6a1-95da0b35882a (sw-okdcluster)
port sw-okdcluster-lr0
type: router
router-port: lr0-sw-okdcluster
port bc2788b6-44ac-5cc8-b682-2e136f346347
addresses: ["52:54:00:11:9f:95 10.0.0.253"]
port 9a1cf3eb-91a4-568e-8a83-a9571341c0d7
addresses: ["52:54:00:41:8d:6d 10.0.0.3"]
port 2468be6a-70a0-5172-bae3-3ee42af2b4a6
addresses: ["52:54:00:cb:2c:dd 10.0.0.2"]
port 011de5eb-faba-5cf3-9765-3781278fb400
addresses: ["52:54:00:92:77:d0 10.0.0.1"]
and
# ovs-vsctl list interface veth0-0a000001
...
external_ids : {attached-mac="52:54:00:92:77:d0", iface-id="011de5eb-faba-5cf3-9765-3781278fb400", iface-status=active, ovn-installed="true", ovn-installed-ts="1645218538948", vm-id="800814ee-19d5-4ea6-a291-028443ffe8ea"}
Switched to CentOS8-Stream for the Hypervisor and deployed with the quay image method.
Describe the bug After cloning kubeinit repository, and following all the steps on the README file, the deployment fails during:
Error:
To Reproduce Steps to reproduce the behavior:
192.168.0.10 Server-TR nyctea
Last login: Tue Feb 8 15:30:34 2022 from 192.168.0.10 [root@Server-TR ~]#
Error:
Expected behavior An OKD cluster deployed with 3 Control and 1 Compute Nodes
Screenshots If applicable, add screenshots to help explain your problem.
Infrastructure
Deployment command
Inventory file diff
Run the following command:
And paste the output:
Additional context Add any other context about the problem here.
Deploying the cluster to my local Server-TR Fedora 35 which is also acting as Hypervisor-01