Kubeinit / kubeinit

Ansible automation to have a KUBErnetes cluster INITialized as soon as possible...
https://www.kubeinit.org
Apache License 2.0
218 stars 57 forks source link

OKD cluster failed to deploy - "container already in use" error is seen in bootstrap #642

Closed logeshwaris closed 2 years ago

logeshwaris commented 2 years ago

Describe the bug Trying to deploy OKD cluster with 1 master and 2 worker nodes. While running the ansible playbook, I see that controller nodes doesn't come to Ready state even after 60tries. When i logged into the bootstrap node, i see the error "Container already in use". If I remove the container, it comes up fine. It doesn't throw this error every time. Atleast seen this issue 2 times out of 5.

To Reproduce Steps to reproduce the behavior:

  1. Clone kubeinit

  2. Run the command

    ansible-playbook \ -v --user root \ -e kubeinit_spec=okd-libvirt-1-2-1 \ -i ./kubeinit/inventory \ ./kubeinit/playbook.yml

  3. See error BootStrap logs:

    Apr 06 10:59:08 bootstrap podman[8121]: 2022-04-06 10:59:08.285934624 +0000 UTC m=+1.377781571 container cleanup 6581fb6d4ff11c4d91635217c4a27d80453> Apr 06 10:59:18 bootstrap podman[8219]: 2022-04-06 10:59:07.13491223 +0000 UTC m=+0.082398628 image pull quay.io/openshift/okd-content@sha256:be5eb> Apr 06 10:59:19 bootstrap podman[8219]: 2022-04-06 10:59:19.016405592 +0000 UTC m=+11.963891960 container create 92aa4efe11dc0d1e4e99c182b209b9dc6b4> Apr 06 10:59:19 bootstrap podman[8219]: 2022-04-06 10:59:19.641219899 +0000 UTC m=+12.588706277 container init 92aa4efe11dc0d1e4e99c182b209b9dc6b468> Apr 06 10:59:19 bootstrap podman[8219]: 2022-04-06 10:59:19.674477098 +0000 UTC m=+12.621963466 container start 92aa4efe11dc0d1e4e99c182b209b9dc6b46> Apr 06 10:59:19 bootstrap podman[8219]: 2022-04-06 10:59:19.674690723 +0000 UTC m=+12.622177121 container attach 92aa4efe11dc0d1e4e99c182b209b9dc6b4> Apr 06 10:59:20 bootstrap systemd[1]: Stopping Bootstrap a Kubernetes cluster... Apr 06 10:59:20 bootstrap bootkube.sh[9514]: open pidfd: No such process Apr 06 10:59:20 bootstrap bootkube.sh[8219]: time="2022-04-06T10:59:20Z" level=error msg="Error forwarding signal 15 to container 92aa4efe11dc0d1e4e> Apr 06 10:59:20 bootstrap bootkube.sh[2056]: Terminated Apr 06 10:59:20 bootstrap podman[9521]: 2022-04-06 10:59:20.28349949 +0000 UTC m=+0.040186130 container died 92aa4efe11dc0d1e4e99c182b209b9dc6b46848> Apr 06 10:59:20 bootstrap systemd[1]: bootkube.service: Deactivated successfully. Apr 06 10:59:20 bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster. Apr 06 10:59:20 bootstrap systemd[1]: bootkube.service: Consumed 33.650s CPU time. Apr 06 10:59:20 bootstrap systemd[1]: release-image.service: Deactivated successfully. Apr 06 10:59:20 bootstrap systemd[1]: Stopped Download the OpenShift Release Image. Apr 06 10:59:20 bootstrap systemd[1]: release-image.service: Consumed 12.351s CPU time. -- Boot c93e0d5bc8b44038b0d5d265ed467c93 -- Apr 06 10:59:31 bootstrap systemd[1]: Starting Download the OpenShift Release Image... Apr 06 10:59:31 bootstrap release-image-download.sh[966]: Pulling service.okdcluster.kubeinit.local:5000/okd@sha256:7d8356245fc3a75fe11d1832ce9fef17> Apr 06 10:59:32 bootstrap podman[1015]: 2022-04-06 10:59:32.079196063 +0000 UTC m=+0.961207467 system refresh Apr 06 10:59:32 bootstrap release-image-download.sh[1015]: 5c93a0adf473e01f1bd88d3e539dbbe6de5bcfb74eace85038a63490f9603143 Apr 06 10:59:32 bootstrap podman[1015]: 2022-04-06 10:59:32.080829538 +0000 UTC m=+0.962840932 image pull service.okdcluster.kubeinit.local:5000/ok> Apr 06 10:59:33 bootstrap systemd[1]: Finished Download the OpenShift Release Image. Apr 06 10:59:41 bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster. . . . . . . Apr 06 11:35:46 bootstrap podman[308085]: 2022-04-06 11:35:46.277425197 +0000 UTC m=+0.499459314 container remove d5440565f1b94e5a176c11750c60d4d45861976990b4f5f1aa56bdace09eb412 (image=service.okdcluster.kubeinit.local:5000/okd@sha256:7d8356245fc3a75fe11d1832ce9fef17f3dd0f2ea6f38271319c95918416b9d9, name=quizzical_ellis, io.openshift.release=4.9.0-0.okd-2021-11-28-035710, io.openshift.release.base-image-digest=sha256:24a6759ce7d34123ae68ee14ee2a7c52ec3b2c7a5ae65cf87651176661e55e58) Apr 06 11:35:46 bootstrap bootkube.sh[306030]: Rendering Kubernetes API server core manifests... Apr 06 11:35:46 bootstrap bootkube.sh[308213]: Error: error creating container storage: the container name "kube-apiserver-render" is already in use by "92aa4efe11dc0d1e4e99c182b209b9dc6b468483438865d8a2bcef825b22c65b". You have to remove that container to be able to reuse that name.: that name is already in use Apr 06 11:35:46 bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=125/n/a Apr 06 11:35:46 bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'. Apr 06 11:35:46 bootstrap systemd[1]: bootkube.service: Consumed 4.452s CPU time.

[core@bootstrap ~]$ sudo podman ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES ed027262a5fa service.okdcluster.kubeinit.local:5000/okd@sha256:7d8356245fc3a75fe11d1832ce9fef17f3dd0f2ea6f38271319c95918416b9d9 render --output-d... 38 minutes ago Exited (0) 38 minutes ago cvo-render 95b964e69d58 quay.io/openshift/okd-content@sha256:8c24b5ca67f5cd7763dbcb1586cfcfcff2083eae137acfea6f9b0468fcd2e8e6 /usr/bin/cluster-... 37 minutes ago Exited (0) 37 minutes ago etcd-render 6581fb6d4ff1 quay.io/openshift/okd-content@sha256:5a262a1ca5b05a174286494220a1f583ed1fcb2fb60114aae25f6d2670699746 /usr/bin/cluster-... 37 minutes ago Exited (0) 37 minutes ago config-render 92aa4efe11dc quay.io/openshift/okd-content@sha256:be5eb9ef4a8c26ce7e5827285a4e65620aa7b31c9fb203e046c900a45b095764 /usr/bin/cluster-... 36 minutes ago Created kube-apiserver-render [core@bootstrap ~]$

Expected behavior Running OKD Cluster with 1 master and 2 worker nodes.

Infrastructure Hypervisors OS: CentOS-Stream 8 CPUs : 32 Cores Memory: 128 GB HDD: 1TB

Deployment command

ansible-playbook \ -v --user root \ -e kubeinit_spec=okd-libvirt-1-2-1 \ -i ./kubeinit/inventory \ ./kubeinit/playbook.yml

Inventory file diff diff --git a/kubeinit/inventory b/kubeinit/inventory index bbb380d..d862b0e 100644 --- a/kubeinit/inventory +++ b/kubeinit/inventory @@ -72,8 +72,8 @@ kubeinit_inventory_network_name=kimgtnet0

[hypervisor_hosts]

hypervisor-01 ansible_host=nyctea -hypervisor-02 ansible_host=tyto -# hypervisor-01 ansible_host=nyctea ssh_hostname=server1.example.com +#hypervisor-02 ansible_host=tyto +# hypervisor-01 ansible_host=nyctea . . . [controller_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'} -disk=25G +disk=150G ram=25165824 vcpus=8 maxvcpus=16 @@ -152,8 +152,8 @@ target_order=hypervisor-01

[compute_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'} -disk=30G -ram=8388608 +disk=100G +ram=16777216 vcpus=8 maxvcpus=16 type=virtual

ccamacho commented 2 years ago

Hi @logeshwaris from what I was able to see, the error looks like something specific to OKD, instead of the automation to get it deployed. Currently, we are deploying 4.9 but there are newer versions available, let me see if by updating the version the problem goes away.

ccamacho commented 2 years ago

Let's see how it goes here https://github.com/Kubeinit/kubeinit/pull/643

logeshwaris commented 2 years ago

Hi @ccamacho, I tried using the latest and i am seeing the below error. Am I missing something?

Command: ansible-playbook \ -v --user root \ -e kubeinit_spec=okd-libvirt-1-2-1 \ -i ./kubeinit/inventory \ ./kubeinit/playbook.yml

Logs: TASK [kubeinit.kubeinit.kubeinit_prepare : Create ssh config file from template] ** task path: /home/slogeshw/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/create_host_ssh_config.yml:53 Monday 11 April 2022 11:23:31 +0530 (0:00:00.209) 0:00:16.327 ** <127.0.0.1> ESTABLISH LOCAL CONNECTION FOR USER: slogeshw <127.0.0.1> EXEC /bin/sh -c 'echo ~slogeshw && sleep 0' <127.0.0.1> EXEC /bin/sh -c '( umask 77 && mkdir -p "echo /home/slogeshw/.ansible/tmp"&& mkdir "echo /home/slogeshw/.ansible/tmp/ansible-tmp-1649656411.4216487-1386120-60959960057760" && echo ansible-tmp-1649656411.4216487-1386120-60959960057760="echo /home/slogeshw/.ansible/tmp/ansible-tmp-1649656411.4216487-1386120-60959960057760" ) && sleep 0' <127.0.0.1> EXEC /bin/sh -c 'rm -f -r /home/slogeshw/.ansible/tmp/ansible-tmp-1649656411.4216487-1386120-60959960057760/ > /dev/null 2>&1 && sleep 0' The full traceback is: Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/ansible/template/init.py", line 1100, in do_template res = j2_concat(rf) File "