[BUG] All pods failing to pull images

IBMRob commented 1 year ago

General information

OS: macOS
Hypervisor: KVM / Hyper-V / hyperkit / vfkit
Did you run crc setup before starting it (Yes/No)? Yes
Running CRC on: Laptop

CRC version

CRC version: 2.9.0+589ab2cd
OpenShift version: 4.11.3
Podman version: 4.2.0

CRC status

DEBU CRC version: 2.9.0+589ab2cd                  
DEBU OpenShift version: 4.11.3                    
DEBU Podman version: 4.2.0                        
DEBU Running 'crc status'                         
DEBU Checking file: /Users/convery/.crc/machines/crc/.crc-exist 
DEBU Checking file: /Users/convery/.crc/machines/crc/.crc-exist 
DEBU Running SSH command: df -B1 --output=size,used,target /sysroot | tail -1 
DEBU Using ssh private keys: [/Users/convery/.crc/machines/crc/id_ecdsa /Users/convery/.crc/cache/crc_vfkit_4.11.3_arm64/id_ecdsa_crc] 
DEBU SSH command results: err: <nil>, output: 32737570816 13377323008 /sysroot 
CRC VM:          Running
OpenShift:       Running (v4.11.3)
Podman:          
Disk Usage:      13.38GB of 32.74GB (Inside the CRC VM)
Cache Usage:     36.85GB
Cache Directory: /Users/convery/.crc/cache

CRC config

- consent-telemetry                     : no

Host Operating System

ProductName:    macOS
ProductVersion: 12.6
BuildVersion:   21G115

Steps to reproduce

Start arc
Go to the cluster UI and view the pods in the openshift-marketplace project

Expected

All pods to be running

Actual

All pods bar 1 are failing as they are timing out trying to talk to redhat.

Looks like a networking issue given Failed to pull image "registry.redhat.io/redhat/certified-operator-index:v4.11": rpc error: code = Unknown desc = Get "https://registry.redhat.io/auth/realms/rhcc/protocol/redhat-docker-v2/auth?scope=repository%3Aredhat%2Fcertified-operator-index%3Apull&service=docker-registry": dial tcp: lookup registry.redhat.io on 192.168.127.1:53: read udp 192.168.127.2:60694->192.168.127.1:53: i/o timeout

If I try and pull a random image it also fails i.e. Failed to pull image "nginx": rpc error: code = Unknown desc = pinging container registry registry-1.docker.io: Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io on 192.168.127.1:53: read udp 192.168.127.2:55975->192.168.127.1:53: i/o timeout

Logs

Before gather the logs try following if that fix your issue

$ crc delete -f
$ crc cleanup
$ crc setup
$ crc start --log-level debug

Please consider posting the output of crc start --log-level debug on http://gist.github.com/ and post the link in the issue.

praveenkumar commented 1 year ago

@IBMRob Can you try to ssh in the VM using https://github.com/code-ready/crc/wiki/Debugging-guide and try to use podman to pull any image to make sure if that is network issue in the VM or on the cluster side.

IBMRob commented 1 year ago

Inside the VM I was able to curl both Quay.io and also redhat so the VM it self appears to have network access but when attempting to use podman we see the error:

curl

curl -s -o /dev/null -w "%{http_code}" https://quay.io 
200
curl -s -o /dev/null -w "%{http_code}" https://registry.access.redhat.com
301

podman

[core@crc-j2d48-master-0 ~]$ podman pull registry.redhat.io/redhat/certified-operator-index:v4.11
Trying to pull registry.redhat.io/redhat/certified-operator-index:v4.11...
Error: initializing source docker://registry.redhat.io/redhat/certified-operator-index:v4.11: pinging container registry registry.redhat.io: Get "https://registry.redhat.io/v2/": dial tcp: lookup registry.redhat.io on 192.168.127.1:53: read udp 192.168.127.2:46800->192.168.127.1:53: i/o timeout

I did notice when listing all pods its listing

openshift-dns-operator                       dns-operator-dbb946c7b-t4c6r                            1/2     CrashLoopBackOff   8 (28s ago)       25d

I wonder if this is causing networking issues within the cluster?

praveenkumar commented 1 year ago

This looks like #2597 can you provide details like mentioned in https://github.com/code-ready/crc/issues/2597#issuecomment-884763233 ?

mhcastro commented 1 year ago

@IBMRob I get the same issue but the dns pods are running ... so, this should be unrelated. I've opened a similar issue https://github.com/code-ready/crc/issues/3377.

oc get pods -n openshift-dns-operator
NAME                           READY   STATUS    RESTARTS   AGE
dns-operator-85cf76d46-hwzqk   2/2     Running   0          11d

oc get pods -n openshift-dns
NAME                  READY   STATUS    RESTARTS   AGE
dns-default-7m6jv     2/2     Running   0          12d
node-resolver-kkvnz   1/1     Running   0          12d

mhcastro commented 1 year ago

Hi @IBMRob - are you running on Apple M1 Max silicon? I didn't have issues with CRC in the Intel-chip MacOS. Just got this on the new Apple M1 Max silicon.

IBMRob commented 1 year ago

@mhcastro - Yes this only happens on my M1 Max version. I didn't have any problems on my previous intel Mac

mhcastro commented 1 year ago

@mhcastro - Yes this only happens on my M1 Max version. I didn't have any problems on my previous intel Mac

Same as me.

praveenkumar commented 1 year ago

By any chance you have podman-machine also running in parallel?

IBMRob commented 1 year ago

No I don't have podman-machine installed

mhcastro commented 1 year ago

Hi @praveenkumar - any plans to fix this issue?

cfergeau commented 1 year ago

I've just tried this with the latest crc release (2.10.1) on a M1 macbook and I don't see this issue. The openshift-marktplace pods are all running. Connecting to registry.redhat.io from within the VM "works" (it connects but fails with 'invalid username/password' which is expected)

mhcastro commented 1 year ago

I tried again.

$ crc version
CRC version: 2.10.1+7e7f6b2d
OpenShift version: 4.11.7
Podman version: 4.2.0

... and I still get the same issue. And this is what happens when I run crc setup and crc start.

$ crc setup
INFO Using bundle path /Users/xxx/.crc/cache/crc_vfkit_4.11.7_arm64.crcbundle
INFO Checking if running as non-root
INFO Checking if crc-admin-helper executable is cached
INFO Checking for obsolete admin-helper executable
INFO Checking if running on a supported CPU architecture
INFO Checking minimum RAM requirements
INFO Checking if crc executable symlink exists
INFO Creating symlink for crc executable
INFO Checking if running emulated on a M1 CPU
INFO Checking if vfkit is installed
INFO Checking if CRC bundle is extracted in '$HOME/.crc'
INFO Checking if /Users/xxx/.crc/cache/crc_vfkit_4.11.7_arm64.crcbundle exists
INFO Checking if old launchd config for tray and/or daemon exists
INFO Checking if crc daemon plist file is present and loaded
INFO Adding crc daemon plist file and loading it
Your system is correctly setup for using CRC. Use 'crc start' to start the instance

$ crc start
INFO Checking if running as non-root
INFO Checking if crc-admin-helper executable is cached
INFO Checking for obsolete admin-helper executable
INFO Checking if running on a supported CPU architecture
INFO Checking minimum RAM requirements
INFO Checking if crc executable symlink exists
INFO Checking if running emulated on a M1 CPU
INFO Checking if vfkit is installed
INFO Checking if old launchd config for tray and/or daemon exists
INFO Checking if crc daemon plist file is present and loaded
INFO Loading bundle: crc_vfkit_4.11.7_arm64...
CRC requires a pull secret to download content from Red Hat.
You can copy it from the Pull Secret section of https://console.redhat.com/openshift/create/local.
? Please enter the pull secret ************************

INFO Creating CRC VM for openshift 4.11.7...
INFO Generating new SSH key pair...
INFO Generating new password for the kubeadmin user
INFO Starting CRC VM for openshift 4.11.7...
INFO CRC instance is running with IP 127.0.0.1
INFO CRC VM is running
INFO Updating authorized keys...
INFO Configuring shared directories
INFO Check internal and public DNS query...
INFO Check DNS query from host...
INFO Verifying validity of the kubelet certificates...
INFO Starting kubelet service
INFO Waiting for kube-apiserver availability... [takes around 2min]
INFO Adding user's pull secret to the cluster...
INFO Updating SSH key to machine config resource...
INFO Waiting for user's pull secret part of instance disk...
INFO Changing the password for the kubeadmin user
INFO Updating cluster ID...
INFO Updating root CA cert to admin-kubeconfig-client-ca configmap...
INFO Starting openshift instance... [waiting for the cluster to stabilize]
INFO 4 operators are progressing: image-registry, network, openshift-controller-manager, service-ca
INFO 4 operators are progressing: image-registry, network, openshift-controller-manager, service-ca
INFO 4 operators are progressing: image-registry, network, openshift-controller-manager, service-ca
INFO 4 operators are progressing: image-registry, network, openshift-controller-manager, service-ca
INFO 4 operators are progressing: image-registry, network, openshift-controller-manager, service-ca
INFO 4 operators are progressing: image-registry, network, openshift-controller-manager, service-ca
INFO 2 operators are progressing: image-registry, network
INFO Operator image-registry is progressing
INFO All operators are available. Ensuring stability...
INFO Operators are stable (2/3)...
INFO Operators are stable (3/3)...
INFO Adding crc-admin and crc-developer contexts to kubeconfig...
Started the OpenShift cluster.

Are you also running a crc_vfkit_4.11.7_arm64 bundle?

mhcastro commented 1 year ago

Perhaps this also helps to understand the issue.

CRC is able to successfully pull images from "quay.io", but fails to pull from other registries, including "registry.redhat.io".

This is evidence of a successful pull in the same CRC.

Events:
  Type     Reason        Age                From               Message
  ----     ------        ----               ----               -------
  Normal   Scheduled     20d                default-scheduler  Successfully assigned openshift-multus/multus-98qmd to crc-m9jbq-master-0 by crc-m9jbq-bootstrap
  Normal   Pulling       20d                kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:050edf40d065b60c642be113092d2ebc157fcab62345324398bc81673794ecf7af"
  Normal   Pulled        20d                kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:050edf40d065b60c642be113092d2ebc157fcab6242342342c81673794ecf7af" in 2.7922208s

It is only working with quay.io.

cfergeau commented 1 year ago

what does registry.redhat.io resolves to in the VM? host registry.redhat.io

mhcastro commented 1 year ago

Can't get into the VM to check.

oc get nodes
NAME                 STATUS   ROLES           AGE   VERSION
crc-m9jbq-master-0   Ready    master,worker   20d   v1.24.0+3882f8f

$ oc debug node/crc-m9jbq-master-0
Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
Starting pod/crc-m9jbq-master-0-debug ...

To use host binaries, run `chroot /host`

warning: Container container-00 is unable to start due to an error: Back-off pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b303c164dcb97cff2e317f1f9a297ba42916858e498e8xxxx671171b103915da"

But I executed it from another running pod - e.g., like "multus" where it was able to pull from "quay.io":

$ oc exec -it multus-98qmd -n openshift-multus -- /bin/bash
[root@crc-m9jbq-master-0 /]# host quay.io
quay.io has address 50.19.184.112
quay.io has address 52.5.230.17
quay.io has address 52.70.198.51
quay.io has address 52.205.244.25
quay.io has address 75.101.245.134
quay.io has address 50.16.37.250

[root@crc-m9jbq-master-0 /]# host registry.redhat.io
registry.redhat.io has address 92.122.161.165

Then pinged from the "multus" pod...

[root@crc-m9jbq-master-0 /]# ping quay.io
PING quay.io (52.5.230.17) 56(84) bytes of data.
64 bytes from 52.5.230.17 (52.5.230.17): icmp_seq=1 ttl=64 time=0.365 ms
64 bytes from 52.5.230.17 (52.5.230.17): icmp_seq=2 ttl=64 time=0.770 ms
64 bytes from 52.5.230.17 (52.5.230.17): icmp_seq=3 ttl=64 time=0.489 ms

[root@crc-m9jbq-master-0 /]# ping registry.redhat.io
PING registry.redhat.io (92.122.161.165) 56(84) bytes of data.
64 bytes from 92.122.161.165 (92.122.161.165): icmp_seq=1 ttl=64 time=0.312 ms
64 bytes from 92.122.161.165 (92.122.161.165): icmp_seq=2 ttl=64 time=1.69 ms
64 bytes from 92.122.161.165 (92.122.161.165): icmp_seq=3 ttl=64 time=0.324 ms
64 bytes from 92.122.161.165 (92.122.161.165): icmp_seq=4 ttl=64 time=3.74 ms

mhcastro commented 1 year ago

Progress.

This is how the openshift-marketplace looked like before the fix.

myuser@mymac ~ % oc get pods
NAME                                    READY   STATUS             RESTARTS   AGE
certified-operators-h2lr5               0/1     ImagePullBackOff   0          20d
certified-operators-zj64b               0/1     ImagePullBackOff   3          21d
community-operators-lwzds               0/1     ImagePullBackOff   0          20d
community-operators-tr5xs               0/1     ImagePullBackOff   3          21d
marketplace-operator-8485c7444b-95prb   1/1     Running            0          20d
redhat-marketplace-hgwv8                0/1     ImagePullBackOff   0          20d
redhat-marketplace-wqdpg                0/1     ImagePullBackOff   3          21d
redhat-operators-87h27                  0/1     ImagePullBackOff   0          5m14s
redhat-operators-pf5gn                  0/1     ImagePullBackOff   0          3m24s

Then I completely cleared CRC.

$ crc delete -f
$ crc delete --clear-cache
$ crc cleanup
$ ps - ef | grep crc (delete any hanging pid)
$ crc setup
$ crc start

Collected IPV4 from the /etc/resolv.conf from my M1 Mac.
SSH'ed to the VM, commented any nameserver entry present in the file and added the IPV4 collected from my Mac into the VM /etc/resolv.conf.

ssh -i .crc/machines/crc/id_ecdsa core@127.0.0.1 -p 2222

sudo vi /etc/resolv.conf

Exited the VM and deleted the redhat-operators pod in the openshift-markeplace. Then all the pods in this namespace started to be recreated. And this is how it looks now.

Note: the connection to the cluster will be lost while the marketplace-operator is restarted, but it automatically reconnected.

oc get pods -n openshift-marketplace
NAME                                    READY   STATUS              RESTARTS      AGE
certified-operators-crkvd               0/1     ContainerCreating   0             20m
certified-operators-zj64b               1/1     Running             0             21d
community-operators-r89sz               0/1     ContainerCreating   0             20m
community-operators-tr5xs               1/1     Running             0             21d
marketplace-operator-8485c7444b-95prb   1/1     Running             2 (22m ago)   20d
redhat-marketplace-68l7p                0/1     ContainerCreating   0             20m
redhat-marketplace-wqdpg                1/1     Running             0             21d
redhat-operators-87h27                  1/1     Running             0             32m
redhat-operators-kxtfh                  0/1     ContainerCreating   0             15m

But now I have a different error:

Warning FailedCreatePodSandBox 38s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_certified-operators-pkjtz_openshift-marketplace_0133d97d-ac62-49ed-9111-c6552cd7a58b_0(a5d98752bca56ff26909a2f5abe94b3c4987def2ae8009cc4909433d906f4885): error adding pod openshift-marketplace_certified-operators-pkjtz to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): Multus: [openshift-marketplace/certified-operators-pkjtz/0133d97d-ac62-49ed-9111-c6552cd7a58b]: error getting pod: Get "https://[api-int.crc.testing]:6443/api/v1/namespaces/openshift-marketplace/pods/certified-operators-pkjtz?timeout=1m0s": dial tcp: lookup api-int.crc.testing on 192.168.0.1:53: read udp 192.168.127.2:34211->192.168.0.1:53: i/o timeout

I added api-int.crc.testing to my Mac /etc/hosts and restarted the pods but no success. But it is awkwards as the other respective pod is running.

cfergeau commented 1 year ago

SSH'ed to the VM, commented any nameserver entry present in the file and added the IPV4 collected from my Mac into the VM /etc/resolv.conf.

Do you know what was wrong with the DNS the VM had?

mhcastro commented 1 year ago

It didn't include the actual DNS entry of the hosting machine.

mhcastro commented 1 year ago

@cfergeau , but now I have this issue with the CNI network. So, unsure if my DNS change to the VM fixed the issued or introduced another problem.

praveenkumar commented 1 year ago

@mhcastro VM's /etc/resolv.conf have the user stack network nameserver which suppose to handle requests like api-int.crc.testing you shouldn't remove those.

mhcastro commented 1 year ago

Something is really wrong with this DNS setup in the crc VM, but couldn't find a way to permanently solve it.

For quay.io it works just fine.

One additional type of information:

From the CRC VM:

[core@crc-xxx-master-0 ~]$ host quay.io
quay.io has address 52.205.244.25
quay.io has address 3.230.30.87
quay.io has address 3.232.106.42
quay.io has address 18.215.228.51
quay.io has address 50.16.37.250
quay.io has address 52.70.198.51

[core@crc-xxx-master-0 ~]$ host registry.redhat.io
registry.redhat.io has address 92.122.161.165

From the host M1 Mac:

myuser@mym1mac ~ % host quay.io
quay.io has address 3.230.30.87
quay.io has address 3.232.106.42
quay.io has address 18.215.228.51
quay.io has address 50.16.37.250
quay.io has address 52.70.198.51
quay.io has address 52.205.244.25
quay.io has IPv6 address 2600:1f18:483:cf00:73b2:5198:e32b:d790
quay.io has IPv6 address 2600:1f18:483:cf00:f053:17f8:315:f7eb
quay.io has IPv6 address 2600:1f18:483:cf01:1e95:f72a:23b:23ff
quay.io has IPv6 address 2600:1f18:483:cf01:f33f:9173:c6d7:dcdd
quay.io has IPv6 address 2600:1f18:483:cf02:ab11:6617:79b0:cf33
quay.io has IPv6 address 2600:1f18:483:cf02:eec9:4ec2:8f41:f7a2
quay.io mail is handled by 1 us-smtp-inbound-1.mimecast.com.

myuser@mym1mac ~ % host registry.redhat.io
registry.redhat.io is an alias for registry.redhat.io.edgekey.net.
registry.redhat.io.edgekey.net is an alias for e14353.g.akamaiedge.net.
e14353.g.akamaiedge.net has address 92.122.161.165

It also fails for registry.docker.io

From the CRC VM:

$ host registry.docker.io
registry.docker.io has address 44.205.64.79
registry.docker.io has address 3.216.34.172
registry.docker.io has address 34.205.13.154

From the host M1 Mac:

host registry.docker.io
registry.docker.io is an alias for registry-1.docker.io.
registry-1.docker.io has address 3.216.34.172
registry-1.docker.io has address 34.205.13.154
registry-1.docker.io has address 44.205.64.79

The CRC VM is not properly configured to handle aliases. It works for quay.io which doesn't have an alias and failed in the same way for both registry.redhat.io and registry.docker.io which have aliases.

whitfiea commented 1 year ago

@mhcastro @IBMRob I had this same issue as well on M1 Max and I resolved it by making sure that my DNS had 8.8.8.8 at the top of the stack of DNS server entries in my Network preferences. I didn't restart anything else and it started to pull from registry.redhat. I did have 8.8.8.8 at the bottom of the stack previously so it looks like one of the other internal IBM DNS servers (I see you are both from IBM) is getting in the way for some reason.

mhcastro commented 1 year ago

This can be closed. It is working now after upgrading to Mac OS Ventura 13.3 on M1 chip, without any change to DNS.

oc get pods -n openshift-marketplace
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-zj64b               1/1     Running   0          194d
community-operators-tr5xs               1/1     Running   0          194d
marketplace-operator-8485c7444b-95prb   1/1     Running   0          193d
redhat-marketplace-wqdpg                1/1     Running   0          194d
redhat-operators-r7zk5                  1/1     Running   0          194d

praveenkumar commented 11 months ago

@IBMRob I am closing it, please create a new one if issue still persist with latest version of OpenShift local.

crc-org / crc