Not able to complete a deployment - failed during the sampleapp validation

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

Describe the bug After running a deployment by using the container image, it fails during the sampleapp validation

To Reproduce Steps to reproduce the behavior:

Clone kubeinit

Run from the container with:

podman run --rm -it -v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z -v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z -v ~/.ssh/config:/root/.ssh/cong:z -v ./kubeinit/inventory:/kubeinit/kubeinit/inventory quay.io/kubeinit/kubeinit:2.0.1 -vvv --user root -e kubeinit_spec=okd-libvirt-1-3-1 -i ./kubeinit/inventory ./kubeinit/playbook.ym

Error:


TASK [kubeinit.kubeinit.kubeinit_apps : Wait until pods are created] ******************************************************************************
task path: /root/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_apps/tasks/sampleapp.yml:35
Using module file /usr/lib/python3.9/site-packages/ansible/modules/command.py
Pipelining is enabled.
<10.0.0.253> ESTABLISH SSH CONNECTION FOR USER: root
<10.0.0.253> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/okdcluster_id_rsa' -o 'ProxyCommand=ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i ~/.ssh/okdcluster_id_rsa -W %h:%p -q root@nyctea' -o ControlPath=/root/.ansible/cp/2a631e4199 10.0.0.253 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<10.0.0.253> (0, b'\n{"changed": true, "stdout": "sampleapp-6684887657-dspc5   0/1     Pending   0          0s\\nsampleapp-6684887657-fr8r5   0/1     Pending   0          0s\\nsampleapp-6684887657-t69wp   0/1     Pending   0          0s\\nsampleapp-6684887657-zl59s   0/1     Pending   0          0s", "stderr": "", "rc": 0, "cmd": "set -o pipefail\\nkubectl get pods --namespace=sampleapp | grep sampleapp\\n", "start": "2022-04-01 14:55:34.298003", "end": "2022-04-01 14:55:34.371286", "delta": "0:00:00.073283", "msg": "", "invocation": {"module_args": {"executable": "/bin/bash", "_raw_params": "set -o pipefail\\nkubectl get pods --namespace=sampleapp | grep sampleapp\\n", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "creates": null, "removes": null, "stdin": null}}}\n', b'')
changed: [localhost -> service(10.0.0.253)] => {
"attempts": 1,
"changed": true,
"cmd": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep sampleapp\n",
"delta": "0:00:00.073283",
"end": "2022-04-01 14:55:34.371286",
"invocation": {
    "module_args": {
        "_raw_params": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep sampleapp\n",
        "_uses_shell": true,
        "argv": null,
        "chdir": null,
        "creates": null,
        "executable": "/bin/bash",
        "removes": null,
        "stdin": null,
        "stdin_add_newline": true,
        "strip_empty_ends": true,
        "warn": false
    }
},
"msg": "",
"rc": 0,
"start": "2022-04-01 14:55:34.298003",
"stderr": "",
"stderr_lines": [],
"stdout": "sampleapp-6684887657-dspc5   0/1     Pending   0          0s\nsampleapp-6684887657-fr8r5   0/1     Pending   0          0s\nsampleapp-6684887657-t69wp   0/1     Pending   0          0s\nsampleapp-6684887657-zl59s   0/1     Pending   0          0s",
"stdout_lines": [
    "sampleapp-6684887657-dspc5   0/1     Pending   0          0s",
    "sampleapp-6684887657-fr8r5   0/1     Pending   0          0s",
    "sampleapp-6684887657-t69wp   0/1     Pending   0          0s",
    "sampleapp-6684887657-zl59s   0/1     Pending   0          0s"
]
}

TASK [kubeinit.kubeinit.kubeinit_apps : Wait until pods are running] ** task path: /root/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_apps/tasks/sampleapp.yml:48 Using module file /usr/lib/python3.9/site-packages/ansible/modules/command.py Pipelining is enabled. <10.0.0.253> ESTABLISH SSH CONNECTION FOR USER: root <10.0.0.253> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/okdcluster_id_rsa' -o 'ProxyCommand=ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i ~/.ssh/okdcluster_id_rsa -W %h:%p -q root@nyctea' -o ControlPath=/root/.ansible/cp/2a631e4199 10.0.0.253 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"'' <10.0.0.253> (1, b'\n{"changed": true, "stdout": "", "stderr": "", "rc": 1, "cmd": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n", "start": "2022-04-01 14:55:34.527107", "end": "2022-04-01 14:55:34.597990", "delta": "0:00:00.070883", "failed": true, "msg": "non-zero return code", "invocation": {"module_args": {"executable": "/bin/bash", "_raw_params": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "creates": null, "removes": null, "stdin": null}}}\n', b'') <10.0.0.253> Failed to connect to the host via ssh: FAILED - RETRYING: [localhost -> service]: Wait until pods are running (60 retries left).Result was: { "attempts": 1, "changed": false, "cmd": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n", "delta": "0:00:00.070883", "end": "2022-04-01 14:55:34.597990", "invocation": { "module_args": { "_raw_params": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": "/bin/bash", "removes": null, "stdin": null, "stdin_add_newline": true, "strip_empty_ends": true, "warn": false } }, "msg": "non-zero return code", "rc": 1, "retries": 61, "start": "2022-04-01 14:55:34.527107", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []

< 59 attempts later> fatal: [localhost -> service(10.0.0.253)]: FAILED! => { "attempts": 60, "changed": false, "cmd": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n", "delta": "0:00:00.071734", "end": "2022-04-01 15:00:52.670565", "invocation": { "module_args": { "_raw_params": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": "/bin/bash", "removes": null, "stdin": null, "stdin_add_newline": true, "strip_empty_ends": true, "warn": false } }, "msg": "non-zero return code", "rc": 1, "start": "2022-04-01 15:00:52.598831", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": [] }

4. TroubleShooting

oc get deployments -n sampleapp NAME READY UP-TO-DATE AVAILABLE AGE sampleapp 0/4 4 0 3h32m

sh-4.4# oc describe pod sampleapp-6684887657-vqnfw -n sampleapp Name: sampleapp-6684887657-vqnfw Namespace: sampleapp Priority: 0 Node: worker1/10.0.0.3 Start Time: Fri, 01 Apr 2022 18:30:23 +0000 Labels: app=sampleapp pod-template-hash=6684887657 Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.102.0.9" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.102.0.9" ], "default": true, "dns": {} }] openshift.io/scc: restricted Status: Pending IP: 10.102.0.9 IPs: IP: 10.102.0.9 Controlled By: ReplicaSet/sampleapp-6684887657 Containers: nginx: Container ID:
Image: quay.io/bitnami/nginx:latest Image ID:
Port: 80/TCP Host Port: 0/TCP State: Waiting Reason: ImagePullBackOff Ready: False Restart Count: 0 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nll5f (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-nll5f: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message

Normal Scheduled 8m27s default-scheduler Successfully assigned sampleapp/sampleapp-6684887657-vqnfw to worker1 Normal AddedInterface 8m25s multus Add eth0 [10.102.0.9/23] from openshift-sdn Warning Failed 7m24s kubelet Failed to pull image "quay.io/bitnami/nginx:latest": rpc error: code = Unknown desc = pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 34.225.41.113:443: connect: no route to host Normal Pulling 6m34s (x4 over 8m25s) kubelet Pulling image "quay.io/bitnami/nginx:latest" Warning Failed 6m28s (x3 over 8m19s) kubelet Failed to pull image "quay.io/bitnami/nginx:latest": rpc error: code = Unknown desc = pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 54.144.203.57:443: connect: no route to host Warning Failed 6m28s (x4 over 8m19s) kubelet Error: ErrImagePull Warning Failed 6m15s (x6 over 8m18s) kubelet Error: ImagePullBackOff Normal BackOff 3m20s (x18 over 8m18s) kubelet Back-off pulling image "quay.io/bitnami/nginx:latest"


**Expected behavior**
A running OKD cluster with 1 Master and 3 Workers

**Infrastructure**
 - Hypervisors OS: CentOS-Stream 8
 - CPUs : 32 Cores
 - Memory: 128 GB
 - HDD: 1TB

**Deployment command**

odman run --rm -it -v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z -v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z -v ~/.ssh/config:/root/.ssh/cong:z -v ./kubeinit/inventory:/kubeinit/kubeinit/inventory quay.io/kubeinit/kubeinit:2.0.1 -vvv --user root -e kubeinit_spec=okd-libvirt-1-3-1 -i ./kubeinit/inventory ./kubeinit/playbook.yml


**Inventory file**

```#
# Common variables for the inventory
#

[all:vars]

#
# Internal variables
#

ansible_python_interpreter=/usr/bin/python3
ansible_ssh_pipelining=True
ansible_ssh_common_args='-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new'

#
# Inventory variables
#

#
# The default for the cluster name is {{ kubeinit_cluster_distro + 'cluster' }}
# You can override this by setting a specific value in kubeinit_inventory_cluster_name

# kubeinit_inventory_cluster_name=mycluster
kubeinit_inventory_cluster_domain=kubeinit.local

kubeinit_inventory_network_name=kimgtnet0

kubeinit_inventory_network=10.0.0.0/24
kubeinit_inventory_gateway_offset=-2
kubeinit_inventory_nameserver_offset=-3
kubeinit_inventory_dhcp_start_offset=1
kubeinit_inventory_dhcp_end_offset=-4

kubeinit_inventory_controller_name_pattern=controller-%02d
kubeinit_inventory_compute_name_pattern=compute-%02d

kubeinit_inventory_post_deployment_services="none"

#
# Cluster definitions
#

# The networks you will use for your kubeinit clusters.  The network name will be used
# to create a libvirt network for the cluster guest vms.  The network cidr will set
# the range of addresses reserved for the cluster nodes.  The gateway offset will be
# used to select the gateway address within the range, a negative offset starts at the
# end of the range, so for network=10.0.0.0/24, gateway_offset=-2 will select 10.0.0.254
# and gateway_offset=1 will select 10.0.0.1 as the address.  Other offset attributes
# follow the same convention.

[kubeinit_networks]
# kimgtnet0 network=10.0.0.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4
# kimgtnet1 network=10.0.1.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4

# The clusters you are deploying using kubeinit.  If there are no clusters defined here
# then kubeinit will assume you are only using one cluster at a time and will use the
# network defined by kubeinit_inventory_network.

[kubeinit_clusters]
# cluster0 network_name=kimgtnet0
# cluster1 network_name=kimgtnet1
#
# If variables  are defined in this section, they will have precedence when setting
# kubeinit_inventory_post_deployment_services and kubeinit_inventory_network_name
#
# clusterXXX network_name=kimgtnetXXX post_deployment_services="none"
# clusterYYY network_name=kimgtnetYYY post_deployment_services="none"

#
# Hosts definitions
#

# The cluster's guest machines can be distributed across mutiple hosts. By default they
# will be deployed in the first Hypervisor. These hypervisors are activated and used
# depending on how they are referenced in the kubeinit spec string.

[hypervisor_hosts]
hypervisor-01 ansible_host=nyctea
hypervisor-02 ansible_host=tyto

# The inventory will have one host identified as the bastion host. By default, this role will
# be assumed by the first hypervisor, which is the same behavior as the first commented out
# line. The second commented out line would set the second hypervisor to be the bastion host.
# The final commented out line would set the bastion host to be a different host that is not
# being used as a hypervisor for the guests VMs for the clusters using this inventory.

[bastion_host]
# bastion target=hypervisor-01
# bastion target=hypervisor-02
# bastion ansible_host=bastion

# The inventory will have one host identified as the ovn-central host.  By default, this role
# will be assumed by the first hypervisor, which is the same behavior as the first commented
# out line.  The second commented out line would set the second hypervisor to be the ovn-central
# host.

[ovn_central_host]
# ovn-central target=hypervisor-01
# ovn-central target=hypervisor-02

#
# Cluster node definitions
#

# Controller, compute, and extra nodes can be configured as virtual machines or using the
# manually provisioned baremetal machines for the deployment.

# Only use an odd number configuration, this means enabling only 1, 3, or 5 controller nodes
# at a time.

[controller_nodes:vars]
os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'}
disk=120G
ram=25165824
vcpus=8
maxvcpus=16
type=virtual
target_order=hypervisor-01

[controller_nodes]

[compute_nodes:vars]
os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'}
disk=120G
ram=25165824
vcpus=8
maxvcpus=16
type=virtual
target_order="hypervisor-02,hypervisor-01"

[compute_nodes]

[extra_nodes:vars]
os={'cdk': 'ubuntu', 'okd': 'coreos'}
disk=20G
ram={'cdk': '8388608', 'okd': '16777216'}
vcpus=8
maxvcpus=16
type=virtual
target_order="hypervisor-02,hypervisor-01"

[extra_nodes]
juju-controller distro=cdk
bootstrap distro=okd

# Service nodes are a set of service containers sharing the same pod network.
# There is an implicit 'provision' service container which will use a base os
# container image based upon the service_nodes:vars os attribute.

[service_nodes:vars]
os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'centos', 'rke': 'ubuntu'}
target_order=hypervisor-01

[service_nodes]
service services="bind,dnsmasq,haproxy,apache,registry"

Additional context Add any other context about the problem here.

ccamacho commented 2 years ago

Looks like a quay issue??

kubelet            Failed to pull image "[quay.io/bitnami/nginx:latest](http://quay.io/bitnami/nginx:latest)": rpc error: code = Unknown desc = pinging container registry [quay.io](http://quay.io/): Get "https://quay.io/v2/": dial tcp [54.144.203.57:443](http://54.144.203.57:443/): connect: no route to host

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

Yeah, it looks like the sampleapp Pod can't reach Quay .... but strange enough okd-cluster-provision can... any tips for troubleshooting this scenario?

ccamacho commented 2 years ago

Yeah, it looks like the sampleapp Pod can't reach Quay .... but strange enough okd-cluster-provision can... any tips for troubleshooting this scenario?

yeah, that is mostly because for some things we pull from docker (provisioning) and for the infra steps we pull from quay (and the sample app), maybe we could converge to pull everything from the same source.

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

hummm, it looks more like a routing issue, the worker can't reach quay to pull the image from.

kubelet            Failed to pull image "quay.io/bitnami/nginx:latest": rpc error: code = Unknown desc = pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 34.225.41.113:443: connect: no route to host

After login into worker1:

[core@worker1 ~]$ ssh test@8.8.8.8
ssh: connect to host 8.8.8.8 port 22: No route to host
[core@worker1 ~]$ ssh -p 443 quay.io
ssh: connect to host quay.io port 443: No route to host

ccamacho commented 2 years ago

@Gl1TcH-1n-Th3-M4tR1x I have been experiencing similar issues across all the distros... This PR fixed the things for the CI https://github.com/Kubeinit/kubeinit/pull/655 Did you manage to pass that problem?

ccamacho commented 2 years ago

@Gl1TcH-1n-Th3-M4tR1x there was a major breakage because of podman versions not consistent across the different components that are deployed, after #666 I didn't reproduce this anymore.

Gl1TcH-1n-Th3-M4tR1x commented 2 years ago

I can successfully deploy the cluster now.

Kubeinit / kubeinit

Not able to complete a deployment - failed during the sampleapp validation #633