Kubeinit / kubeinit

Ansible automation to have a KUBErnetes cluster INITialized as soon as possible...
https://www.kubeinit.org
Apache License 2.0
219 stars 57 forks source link

kubeinit Deployment in Fedora35 failing to complete #593

Closed Gl1TcH-1n-Th3-M4tR1x closed 2 years ago

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

Describe the bug After cloning kubeinit repository, and following all the steps on the README file, the deployment fails during:

TASK [kubeinit.kubeinit.kubeinit_services : Wait for connection to "okdcluster-credentials" container] **************************************************************************************
task path: /home/german/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_services/tasks/prepare_credentials.yml:101
Tuesday 08 February 2022  15:53:41 -0500 (0:00:00.065)       0:02:15.112 ****** 

Error:

fatal: [localhost -> okdcluster-credentials]: FAILED! => {
    "changed": false,
    "elapsed": 305,
    "msg": "timed out waiting for ping module test: Failed to create temporary directory.In some cases, you may have been able to authenticate and did not have permissions on the target directory. Consider changing the remote tmp path in ansible.cfg to a path rooted in \"/tmp\", for more error information use -vvv. Failed command was: ( umask 77 && mkdir -p \"` echo /tmp `\"&& mkdir \"` echo /tmp/ansible-tmp-1644353921.9780402-2437089-121246699102612 `\" && echo ansible-tmp-1644353921.9780402-2437089-121246699102612=\"` echo /tmp/ansible-tmp-1644353921.9780402-2437089-121246699102612 `\" ), exited with result 125"

To Reproduce Steps to reproduce the behavior:

  1. Clone '...'
  2. Create an Alias for nyctea in /etc/hosts
    
    cat /etc/hosts
    127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
    ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.0.10 Server-TR nyctea

3. Deploy ssh keys to nyctea.
```bash
if [ -f ~/.ssh/id_rsa ]; then
  ssh-keygen
  ssh-copy-id /root/.ssh/id_rsa root@nyctea
fi
  1. Verified root access to nyctea with no password
    
    ssh root@nyctea
    Activate the web console with: systemctl enable --now cockpit.socket

Last login: Tue Feb 8 15:30:34 2022 from 192.168.0.10 [root@Server-TR ~]#

5. Install requirements:
```bash
# Install the requirements assuming python3/pip3 is installed
pip3 install \
        --upgrade \
        pip \
        shyaml \
        ansible \
        netaddr

cd kubeinit

# Install the Ansible collection requirements
ansible-galaxy collection install --force --requirements-file kubeinit/requirements.yml

# Build and install the collection
rm -rf ~/.ansible/collections/ansible_collections/kubeinit/kubeinit
ansible-galaxy collection build kubeinit --verbose --force --output-path releases/
ansible-galaxy collection install --force --force-with-deps releases/kubeinit-kubeinit-`cat kubeinit/galaxy.yml | shyaml get-value version`.tar.gz
  1. Run with these variable '...'
    ansible-playbook \
    -v --user root \
    -e kubeinit_spec=okd-libvirt-3-1-1 \
    -i ./kubeinit/inventory \
    ./kubeinit/playbook.yml
  2. See error
    TASK [kubeinit.kubeinit.kubeinit_services : Wait for connection to "okdcluster-credentials" container] **************************************************************************************
    task path: /home/german/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_services/tasks/prepare_credentials.yml:101
    Tuesday 08 February 2022  15:53:41 -0500 (0:00:00.065)       0:02:15.112 ****** 

    Error:

    fatal: [localhost -> okdcluster-credentials]: FAILED! => {
    "changed": false,
    "elapsed": 305,
    "msg": "timed out waiting for ping module test: Failed to create temporary directory.In some cases, you may have been able to authenticate and did not have permissions on the target directory. Consider changing the remote tmp path in ansible.cfg to a path rooted in \"/tmp\", for more error information use -vvv. Failed command was: ( umask 77 && mkdir -p \"` echo /tmp `\"&& mkdir \"` echo /tmp/ansible-tmp-1644353921.9780402-2437089-121246699102612 `\" && echo ansible-tmp-1644353921.9780402-2437089-121246699102612=\"` echo /tmp/ansible-tmp-1644353921.9780402-2437089-121246699102612 `\" ), exited with result 125"

Expected behavior An OKD cluster deployed with 3 Control and 1 Compute Nodes

Screenshots If applicable, add screenshots to help explain your problem.

Infrastructure

Deployment command

ansible-playbook -vvv --user root \
    -e kubeinit_spec=okd-libvirt-3-1-1 \
    -i ./kubeinit/inventory \
    ./kubeinit/playbook.yml

Inventory file diff

Run the following command:

diff \
    <(curl https://raw.githubusercontent.com/Kubeinit/kubeinit/main/kubeinit/hosts/k8s/inventory) \
    <(curl https://raw.githubusercontent.com/Kubeinit/kubeinit/main/kubeinit/hosts/okd/inventory)

And paste the output:

here

Additional context Add any other context about the problem here.

Deploying the cluster to my local Server-TR Fedora 35 which is also acting as Hypervisor-01

ccamacho commented 2 years ago

I was able to reproduce this in the CI, I was only able to provision the hosts with F35 so the issue is reproduced out of the box.

[root@nyctea ~]# podman ps
CONTAINER ID  IMAGE                                             COMMAND               CREATED        STATUS            PORTS                    NAMES
12e9c53c1d0b  k8s.gcr.io/pause:3.5                                                    6 minutes ago  Up 6 minutes ago  0.0.0.0:26973->8000/tcp  f00f36a649b7-infra
964c5f7879e7  docker.io/recordsansible/ara-api:latest           bash -c /usr/loca...  6 minutes ago  Up 6 minutes ago  0.0.0.0:26973->8000/tcp  api-server
d2718ae04ac9  k8s.gcr.io/pause:3.5                                                    4 minutes ago  Up 4 minutes ago                           ef1044587dd6-infra
af205bdd8a1a  localhost/kubeinit/k8scluster-credentials:latest  sleep infinity        3 minutes ago  Up 3 minutes ago                           k8scluster-credentials
[root@nyctea ~]# podman pod list
POD ID        NAME                    STATUS      CREATED        INFRA ID      # OF CONTAINERS
ef1044587dd6  k8scluster-service-pod  Running     4 minutes ago  d2718ae04ac9  2
f00f36a649b7  ara-pod                 Running     6 minutes ago  12e9c53c1d0b  2
[root@nyctea ~]# podman port -a
12e9c53c1d0b    8000/tcp -> 0.0.0.0:26973
964c5f7879e7    8000/tcp -> 0.0.0.0:26973
[root@nyctea ~]# podman ps -a --pod
CONTAINER ID  IMAGE                                             COMMAND               CREATED        STATUS            PORTS                    NAMES                   POD ID        PODNAME
12e9c53c1d0b  k8s.gcr.io/pause:3.5                                                    7 minutes ago  Up 7 minutes ago  0.0.0.0:26973->8000/tcp  f00f36a649b7-infra      f00f36a649b7  ara-pod
964c5f7879e7  docker.io/recordsansible/ara-api:latest           bash -c /usr/loca...  7 minutes ago  Up 7 minutes ago  0.0.0.0:26973->8000/tcp  api-server              f00f36a649b7  ara-pod
d2718ae04ac9  k8s.gcr.io/pause:3.5                                                    4 minutes ago  Up 4 minutes ago                           ef1044587dd6-infra      ef1044587dd6  k8scluster-service-pod
af205bdd8a1a  localhost/kubeinit/k8scluster-credentials:latest  sleep infinity        3 minutes ago  Up 3 minutes ago                           k8scluster-credentials  ef1044587dd6  k8scluster-service-pod
ccamacho commented 2 years ago

@Gl1TcH-1n-Th3-M4tR1x @gmarcy im not sure what is wrong here, Ill try to test F34 to see if there is something different there.

gmarcy commented 2 years ago

@Gl1TcH-1n-Th3-M4tR1x could you also tell us what the ansible playbook client side environment is like? I can do the git clone on the same F35 host that I'm deploying to, similar to how the github actions run, but would like to match. other thought would be to do the clone and run the container build and run the install from the container, just to rule out any issues in the host running the playbook.

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

The Ansible playbook client is the same Fedora 35 hypervisor-01, basically I run the ansible playbook from the Hypervisor, all I have is an alias for nyctea on my hypervisor-01. Let me try to run the build from the container.

gmarcy commented 2 years ago

@Gl1TcH-1n-Th3-M4tR1x are you running the ansible-playbook under root or a non-root user?

gmarcy commented 2 years ago

@Gl1TcH-1n-Th3-M4tR1x found this in my fedora kickstart, wondering if you updated fedora to accept RSA keys?

update-crypto-policies --set DEFAULT:FEDORA32

I had forgotten about this as I had switched from rsa to ed25519 when I added support for the KUBEINIT_COMMON_SSH_KEYTYPE environment variable.

$ export | grep SSH_KEYTYPE
declare -x KUBEINIT_COMMON_SSH_KEYTYPE="ed25519"

More info here

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

@Gl1TcH-1n-Th3-M4tR1x are you running the ansible-playbook under root or a non-root user?

I'm running the playbook with a non-root user.

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

Running from the Container failed even faster:

TASK [kubeinit.kubeinit.kubeinit_prepare : Put secret values into a dictionary] *************************************************************************************************************
task path: /home/kiuser/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/gather_kubeinit_secrets.yml:39
skipping: [localhost] => (item=None)  => {
    "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
    "changed": false
}
skipping: [localhost] => (item=None)  => {
    "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
    "changed": false
}
skipping: [localhost] => (item=None)  => {
    "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
    "changed": false
}
skipping: [localhost] => (item=None)  => {
    "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
    "changed": false
}
skipping: [localhost] => (item=None)  => {
    "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
    "changed": false
}
skipping: [localhost] => (item=None)  => {
    "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
    "changed": false
}
skipping: [localhost] => (item=None)  => {
    "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
    "changed": false
}
skipping: [localhost] => {
    "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
    "changed": false
}

TASK [kubeinit.kubeinit.kubeinit_prepare : Add secrets to kubeinit secrets] *****************************************************************************************************************
task path: /home/kiuser/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/gather_kubeinit_secrets.yml:49
fatal: [localhost]: FAILED! => {
    "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result"
}

NO MORE HOSTS LEFT **************************************************************************************************************************************************************************

PLAY RECAP **********************************************************************************************************************************************************************************
localhost                  : ok=6    changed=1    unreachable=0    failed=1    skipped=3    rescued=0    ignored=0   
Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

@Gl1TcH-1n-Th3-M4tR1x found this in my fedora kickstart, wondering if you updated fedora to accept RSA keys?

update-crypto-policies --set DEFAULT:FEDORA32

I had forgotten about this as I had switched from rsa to ed25519 when I added support for the KUBEINIT_COMMON_SSH_KEYTYPE environment variable.

$ export | grep SSH_KEYTYPE
declare -x KUBEINIT_COMMON_SSH_KEYTYPE="ed25519"

More info here

Updated Fedora 35 crypto policies, and set KUBEINIT_COMMON_SSH_KEYTYPE to ed25519, and I've got:

TASK [kubeinit.kubeinit.kubeinit_prepare : Confirm that we have ansible host connectivity] **************************************************************************************************
task path: /home/german/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/gather_host_facts.yml:21
Thursday 10 February 2022  04:28:00 -0500 (0:00:00.074)       0:00:01.429 ***** 
Using module file /usr/local/lib/python3.10/site-packages/ansible/modules/ping.py
Pipelining is enabled.
<nyctea> ESTABLISH SSH CONNECTION FOR USER: root
<nyctea> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/id_ed25519' -o 'ControlPath="/home/german/.ansible/cp/19b2ae1270"' nyctea '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
fatal: [localhost -> hypervisor-01]: UNREACHABLE! => {
    "changed": false,
    "msg": "Data could not be sent to remote host \"nyctea\". Make sure this host can be reached over ssh: Warning: Identity file /home/german/.ssh/id_ed25519 not accessible: No such file or directory.\nBad owner or permissions on /home/german/.ssh/config\r\n",
    "unreachable": true
}

PLAY RECAP **********************************************************************************************************************************************************************************
localhost                  : ok=37   changed=6    unreachable=1    failed=0    skipped=11   rescued=0    ignored=0   
gmarcy commented 2 years ago

Reviewed from the beginning, as I'm not able to reproduce, even with the CI installed on the F35 box. Noted you are still using the old edit /etc/hosts mechanism for adding your own host. We updated the docs to use ssh config files as described here in the README but that was more to do with not needing to change system configuration files.

I've never used the /etc/hosts approach which was why I added the ssh config support. I hadn't mentioned it before as I know that the scripts Carlos uses to create the gitlab runners still use /etc/hosts so I assumed it was still working. Am afraid I am grasping at straws as I have no issues with this in my homelab environment with F35.

gmarcy commented 2 years ago

Running from the Container failed even faster:

I am so sorry... I've fallen behind on keeping the docs up to date with the code. We switched from volume mounts to podman secrets for ssh keys, so instead of mounting ~/.ssh keys in the container you would do

podman secret create kubeinit_ssh_key ~/.ssh/id_<keytype>
podman run --secret kubeinit_ssh_key ...

that would replace the params

    -v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z \
    -v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z \
Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

Running from the Container failed even faster:

I am so sorry... I've fallen behind on keeping the docs up to date with the code. We switched from volume mounts to podman secrets for ssh keys, so instead of mounting ~/.ssh keys in the container you would do

podman secret create kubeinit_ssh_key ~/.ssh/id_<keytype>
podman run --secret kubeinit_ssh_key ...

that would replace the params

    -v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z \
    -v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z \

After creating nyctea as a Virtual machine with CentOS8-Stream and a podman secret with:

podman secret create kubeinit_ssh_key ~/.ssh/id_rsa

And running:

podman run --rm -it --secret kubeinit_ssh_key -v ~/.ssh/config:/root/.ssh/config.z kubeinit/kubeinit -vvv --user root -e kubeinit_spec=okd-libvirt-3-1-1 -i ./kubeinit/inventory ./kubeinit/playbook.yml

Kubeinit went all the way to:

TASK [kubeinit.kubeinit.kubeinit_libvirt : Wait for changes to propagate] *************************************************************************
task path: /home/kiuser/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/cleanup_libvirt.yml:132
Using module file /home/kiuser/.local/lib/python3.9/site-packages/ansible/modules/command.py
Pipelining is enabled.
<nyctea> ESTABLISH SSH CONNECTION FOR USER: root
<nyctea> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/id_rsa' -o 'ControlPath="/home/kiuser/.ansible/cp/19b2ae1270"' nyctea '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<nyctea> (1, b'\n{"changed": true, "stdout": "", "stderr": "2022-02-10T21:37:09Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)", "rc": -14, "cmd": ["/usr/bin/ovn-nbctl", "--wait=hv", "--timeout=30", "sync"], "start": "2022-02-10 16:36:39.303190", "end": "2022-02-10 16:37:09.338535", "delta": "0:00:30.035345", "failed": true, "msg": "non-zero return code", "invocation": {"module_args": {"_raw_params": "/usr/bin/ovn-nbctl --wait=hv --timeout=30 sync", "_uses_shell": false, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "executable": null, "creates": null, "removes": null, "stdin": null}}}\n', b'')
<nyctea> Failed to connect to the host via ssh: 
fatal: [localhost -> hypervisor-01(nyctea)]: FAILED! => {
    "changed": false,
    "cmd": [
        "/usr/bin/ovn-nbctl",
        "--wait=hv",
        "--timeout=30",
        "sync"
    ],
    "delta": "0:00:30.035345",
    "end": "2022-02-10 16:37:09.338535",
    "invocation": {
        "module_args": {
            "_raw_params": "/usr/bin/ovn-nbctl --wait=hv --timeout=30 sync",
            "_uses_shell": false,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true,
            "warn": false
        }
    },
    "msg": "non-zero return code",
    "rc": -14,
    "start": "2022-02-10 16:36:39.303190",
    "stderr": "2022-02-10T21:37:09Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)",
    "stderr_lines": [
        "2022-02-10T21:37:09Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)"
    ],
    "stdout": "",
    "stdout_lines": []
}

PLAY RECAP ****************************************************************************************************************************************
hypervisor-01              : ok=17   changed=3    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0   
hypervisor-02              : ok=0    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
localhost                  : ok=173  changed=40   unreachable=0    failed=1    skipped=51   rescued=0    ignored=0   
gmarcy commented 2 years ago

@Gl1TcH-1n-Th3-M4tR1x I was able to deploy a cluster from the container built from main using the following command

podman run --rm -ti -e KUBEINIT_COMMON_SSH_KEYTYPE --secret kubeinit_ssh_key kubeinit/kubeinit -e kubeinit_hypervisor_hosts_spec='[[host=hypervisor-01,ssh_hostname=192.168.0.10]]' -vvv --user root -e kubeinit_spec=okd-libvirt-3-1-1 -i ./kubeinit/inventory ./kubeinit/playbook.yml

The replacement for the ssh config volume mount

-v ~/.ssh/config:/root/.ssh/config.z

was

-e kubeinit_hypervisor_hosts_spec='[[host=hypervisor-01,ssh_hostname=192.168.0.10]]'

Apologies again for not keeping up with the docs on this change. When we switched to running kubeinit/kubeinit as a non-root container it made volume mounts unreliable and so they were replaced with a new mechanism.

ccamacho commented 2 years ago

update-crypto-policies --set DEFAULT:FEDORA32

This fixed the issue in my F35 box, from the standard deployment methods without containers.

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

Ok, new scenario, dedicated a CentOS-Stream as the hypervisor with 64 Cores, 256GB Ram and 4TB hdd, created a VM (in another machine) as the deployment box, launched both ansible-playbook and podman container and both failed at:

TASK [kubeinit.kubeinit.kubeinit_libvirt : Create VM definition for controller-01] **************************************************************************
task path: /home/kiuser/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/deploy_coreos_guest.yml:31
Using module file /home/kiuser/.local/lib/python3.9/site-packages/ansible/modules/command.py
Using module file /home/kiuser/.local/lib/python3.9/site-packages/ansible/modules/command.py
Pipelining is enabled.
<10.0.0.253> ESTABLISH SSH CONNECTION FOR USER: root
<10.0.0.253> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthention=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/okdcluster_id_rsa' -o 'ProxyCommanssh/okdcluster_id_rsa -W %h:%p -q root@nyctea' -o 'ControlPath="/home/kiuser/.ansible/cp/2a631e4199"' 10.0.0.253 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0
<10.0.0.253> (1, b'\n{"changed": true, "stdout": "", "stderr": "Unable to connect to the server: EOF", "rc": 1, "cmd": "set -o pipefail\\nexport KUBECONFIG=~, "start": "2022-02-14 20:58:05.301347", "end": "2022-02-14 20:58:55.396760", "delta": "0:00:50.095413", "failed": true, "msg": "non-zero return code", "invofail\\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \\" Ready\\"\\n", "_uses_shell": true, "warn": false, "stdin_add_nenull, "removes": null, "stdin": null}}}\n', b"Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.\r\n")
<10.0.0.253> Failed to connect to the host via ssh: Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.
FAILED - RETRYING: [localhost -> service]: Verify that controller nodes are ok (55 retries left).Result was: {
    "attempts": 6,
    "changed": false,
    "cmd": "set -o pipefail\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \" Ready\"\n",
    "delta": "0:00:50.095413",
    "end": "2022-02-14 20:58:55.396760",
    "invocation": {
        "module_args": {
            "_raw_params": "set -o pipefail\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \" Ready\"\n",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": "/bin/bash",
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true,
            "warn": false
        }
    },
    "msg": "non-zero return code",
    "rc": 1,
    "retries": 61,
    "start": "2022-02-14 20:58:05.301347",
    "stderr": "Unable to connect to the server: EOF",
    "stderr_lines": [
        "Unable to connect to the server: EOF"
    ],
    "stdout": "",
    "stdout_lines": []
}

In order to check the connectivity, logged into okdcluster-provision container and ran:

export KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes

oc get nodes does not respond

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

update-crypto-policies --set DEFAULT:FEDORA32

This fixed the issue in my F35 box, from the standard deployment methods without containers.

Carlos, could you please detail all the step you used to have the cluster running in FC35?

ccamacho commented 2 years ago

Hi @Gl1TcH-1n-Th3-M4tR1x this is what I did (what is running in the CI).

  1. Deploy a fresh F35 machine.
  2. From the install node script 1 install the dependencies. This would be in your case:
  3. Run the deployment from what is in the readme (I didn't do any other additional change).

If you see the script, the only 'new' thing is this command update-crypto-policies --set DEFAULT:FEDORA32.

This is an example of a successful CI job with the previous steps https://storage.googleapis.com/kubeinit-ci/jobs/okd-libvirt-1-1-1-h-periodic-pid-weekly-u/records/1.html running the previous steps.

@gmarcy did an amazing job putting these prepare steps in a playbook2 but I didn't find time to integrate it in the CI.

gmarcy commented 2 years ago

After Deploy a fresh F35 machine my steps are:

$ cat /etc/os-release NAME="Fedora Linux" VERSION="35 (Thirty Five)" ID=fedora VERSION_ID=35 ... $ update-crypto-policies --show DEFAULT:FEDORA32 $ sudo dnf install -y git podman ... $ git --version git version 2.35.1 $ podman --version podman version 3.4.4 $ export KUBEINIT_COMMON_SSH_KEYTYPE="ed25519" $ ssh-keygen -t ed25519 ... $ sudo mkdir ~root/.ssh $ sudo chmod 700 ~root/.ssh $ sudo cp ~/.ssh/id_ed25519.pub ~root/.ssh/authorized_keys $ ssh root@<ip-address> python3 -V Python 3.10.0 $ podman secret create kubeinit_ssh_key ~/.ssh/id_ed25519 ... $ git clone https://github.com/Kubeinit/kubeinit.git ... $ cd kubeinit $ podman build -t kubeinit/kubeinit . ... $ podman run --rm -ti -e KUBEINIT_COMMON_SSH_KEYTYPE --secret kubeinit_ssh_key kubeinit/kubeinit -e hypervisor_hosts_spec='[[host=hypervisor-01,ssh_hostname=<ip-address>]]' -vvv --user root -e kubeinit_spec=okd-libvirt-3-1-1 -i ./kubeinit/inventory ./kubeinit/playbook.yml ...

I would be very interested in knowing if this does not work properly for you since this is the direction we expect future updates to take.

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

After Deploy a fresh F35 machine my steps are:

$ cat /etc/os-release NAME="Fedora Linux" VERSION="35 (Thirty Five)" ID=fedora VERSION_ID=35 ... $ update-crypto-policies --show DEFAULT:FEDORA32 $ sudo dnf install -y git podman ... $ git --version git version 2.35.1 $ podman --version podman version 3.4.4 $ export KUBEINIT_COMMON_SSH_KEYTYPE="ed25519" $ ssh-keygen -t ed25519 ... $ sudo mkdir ~root/.ssh $ sudo chmod 700 ~root/.ssh $ sudo cp ~/.ssh/id_ed25519.pub ~root/.ssh/authorized_keys $ ssh root@<ip-address> python3 -V Python 3.10.0 $ podman secret create kubeinit_ssh_key ~/.ssh/id_ed25519 ... $ git clone https://github.com/Kubeinit/kubeinit.git ... $ cd kubeinit $ podman build -t kubeinit/kubeinit . ... $ podman run --rm -ti -e KUBEINIT_COMMON_SSH_KEYTYPE --secret kubeinit_ssh_key kubeinit/kubeinit -e hypervisor_hosts_spec='[[host=hypervisor-01,ssh_hostname=<ip-address>]]' -vvv --user root -e kubeinit_spec=okd-libvirt-3-1-1 -i ./kubeinit/inventory ./kubeinit/playbook.yml ...

I would be very interested in knowing if this does not work properly for you since this is the direction we expect future updates to take.

Here what I did, Created a brand new Bare-Metal FC35 Workstation with 32 Cores, 256GB RAM and 4 TB HDD, followed your instructions to the letter and after running the podman run command, it got stuck exactly at the same step as before:

TASK [kubeinit.kubeinit.kubeinit_okd : Verify that controller nodes are ok] *******************************
task path: /home/kiuser/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_okd/tasks/main.yml:41
Using module file /home/kiuser/.local/lib/python3.9/site-packages/ansible/modules/command.py
Pipelining is enabled.
<10.0.0.253> ESTABLISH SSH CONNECTION FOR USER: root
<10.0.0.253> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/okdcluster_id_ed25519' -o 'ProxyCommand=ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i ~/.ssh/okdcluster_id_ed25519 -W %h:%p -q root@nyctea' -o 'ControlPath="/home/kiuser/.ansible/cp/2a631e4199"' 10.0.0.253 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<10.0.0.253> (1, b'\n{"changed": true, "stdout": "", "stderr": "Unable to connect to the server: EOF", "rc": 1, "cmd": "set -o pipefail\\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \\" Ready\\"\\n", "start": "2022-02-15 21:12:20.071571", "end": "2022-02-15 21:13:10.182005", "delta": "0:00:50.110434", "failed": true, "msg": "non-zero return code", "invocation": {"module_args": {"executable": "/bin/bash", "_raw_params": "set -o pipefail\\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \\" Ready\\"\\n", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "creates": null, "removes": null, "stdin": null}}}\n', b"Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.\r\n")
<10.0.0.253> Failed to connect to the host via ssh: Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.
FAILED - RETRYING: [localhost -> service]: Verify that controller nodes are ok (60 retries left).Result was: {
    "attempts": 1,
    "changed": false,
    "cmd": "set -o pipefail\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \" Ready\"\n",
    "delta": "0:00:50.110434",
    "end": "2022-02-15 21:13:10.182005",
    "invocation": {
        "module_args": {
            "_raw_params": "set -o pipefail\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \" Ready\"\n",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": "/bin/bash",
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true,
            "warn": false
        }
    },
    "msg": "non-zero return code",
    "rc": 1,
    "retries": 61,
    "start": "2022-02-15 21:12:20.071571",
    "stderr": "Unable to connect to the server: EOF",
    "stderr_lines": [
        "Unable to connect to the server: EOF"
    ],
    "stdout": "",
    "stdout_lines": []
}
gmarcy commented 2 years ago

Created a brand new Bare-Metal FC35 Workstation

Server or Workstation?

I'm booting from

Fedora-Server-dvd-x86_64-35-1.2.iso

Minimal package install - anaconda-ks.cfg has

%packages
@^custom-environment
@standard

%end

Just trying to understand where the differences are coming from.

My output for the same task is identical to yours, but my response is

<10.0.0.253> (1, b'\n{"changed": true, "stdout": "", "stderr": "No resources found", "rc": 1, "cmd": "set -o pipefail\\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \\" Ready\\"\\n", "start": "2022-02-15 18:39:41.902631", "end": "2022-02-15 18:39:42.249062", "delta": "0:00:00.346431", "failed": true, "msg": "non-zero return code", "invocation": {"module_args": {"executable": "/bin/bash", "_raw_params": "set -o pipefail\\nexport KUBECONFIG=~/install_dir/auth/kubeconfig; oc get nodes | grep master | grep \\" Ready\\"\\n", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "creates": null, "removes": null, "stdin": null}}}\n', b"Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.\r\n")
<10.0.0.253> Failed to connect to the host via ssh: Warning: Permanently added '10.0.0.253' (ECDSA) to the list of known hosts.
FAILED - RETRYING: [localhost -> service]: Verify that controller nodes are ok (60 retries left).Result was: {

I'm going to try installing from Fedora-Workstation-Live-x86_64-35-1.2.iso to see if I can reproduce your failure.

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

I'm using Fedora-Workstation-Live-x86_64-35-1.2.iso

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

Not sure, but it looks like the coreOS VMs are not getting ipaddresses:

virsh # list
 Id   Name                       State
------------------------------------------
 6    okdcluster-bootstrap       running
 8    okdcluster-controller-01   running
 10   okdcluster-controller-02   running
 12   okdcluster-controller-03   running

virsh # domiflist okdcluster-bootstrap
 Interface        Type     Source      Model    MAC
-------------------------------------------------------------------
 veth0-0a000005   bridge   kimgtnet0   virtio   52:54:00:c0:6f:53

 virsh # domiflist okdcluster-controller-01
 Interface        Type     Source      Model    MAC
-------------------------------------------------------------------
 veth0-0a000001   bridge   kimgtnet0   virtio   52:54:00:35:a0:d6

virsh # domiflist okdcluster-controller-02
 Interface        Type     Source      Model    MAC
-------------------------------------------------------------------
 veth0-0a000002   bridge   kimgtnet0   virtio   52:54:00:4b:00:a8

virsh # domiflist okdcluster-controller-03
 Interface        Type     Source      Model    MAC
-------------------------------------------------------------------
 veth0-0a000003   bridge   kimgtnet0   virtio   52:54:00:9b:04:11

virsh # domifaddr okdcluster-bootstrap
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------

virsh # domifaddr okdcluster-controller-01
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------

virsh # domifaddr okdcluster-controller-02
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------

virsh # domifaddr okdcluster-controller-03
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------

virsh # 
gmarcy commented 2 years ago

Still trying to get anything working on F35 Workstation, behaves like a different operating system than F35 Server. Noticing in particular that libvirtd is often inactive even when there are virtual machines running, so we don't always cleanup old virtual machines.

FYI, this is my kickstart output in case anything different jumps out to you

# Generated by Anaconda 35.22.2
# Generated by pykickstart v3.34
#version=DEVEL
# Use graphical install
graphical

# Keyboard layouts
keyboard --vckeymap=us --xlayouts='us'
# System language
lang en_US.UTF-8

%packages
@^workstation-product-environment

%end

# Run the Setup Agent on first boot
firstboot --enable

# Generated using Blivet version 3.4.2
ignoredisk --only-use=nvme0n1
autopart
# Partition clearing information
clearpart --none --initlabel

# System timezone
timezone America/New_York --utc

#Root password
rootpw --lock
Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

Where do the VMs get the IP Address from? when created, the VMs are configured to use dhcp and get connected to br-int bridge.

gmarcy commented 2 years ago

IIRC, it's a combination of things...

# ovn-nbctl show
switch 7c611c5c-be12-4382-a6a1-95da0b35882a (sw-okdcluster)
    port sw-okdcluster-lr0
        type: router
        router-port: lr0-sw-okdcluster
    port bc2788b6-44ac-5cc8-b682-2e136f346347
        addresses: ["52:54:00:11:9f:95 10.0.0.253"]
    port 9a1cf3eb-91a4-568e-8a83-a9571341c0d7
        addresses: ["52:54:00:41:8d:6d 10.0.0.3"]
    port 2468be6a-70a0-5172-bae3-3ee42af2b4a6
        addresses: ["52:54:00:cb:2c:dd 10.0.0.2"]
    port 011de5eb-faba-5cf3-9765-3781278fb400
        addresses: ["52:54:00:92:77:d0 10.0.0.1"]

and

# ovs-vsctl list interface veth0-0a000001
...
external_ids        : {attached-mac="52:54:00:92:77:d0", iface-id="011de5eb-faba-5cf3-9765-3781278fb400", iface-status=active, ovn-installed="true", ovn-installed-ts="1645218538948", vm-id="800814ee-19d5-4ea6-a291-028443ffe8ea"}
Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

Switched to CentOS8-Stream for the Hypervisor and deployed with the quay image method.