[BUG] crc start fails on ` Waiting until the user's pull secret is written to the instance disk...` step

lavocatt commented 7 months ago

General information

OS: Linux
Hypervisor: KVM
Did you run crc setup before starting it (Yes/No)? yes
Running CRC on: Laptop

CRC version

CRC version: 2.34.1+b470b5
OpenShift version: 4.15.3
Podman version: 4.4.4

CRC status

crc status --log-level debug
DEBU CRC version: 2.34.1+b470b5                   
DEBU OpenShift version: 4.15.3                    
DEBU Podman version: 4.4.4                        
DEBU Running 'crc status'                         
CRC VM:          Running
OpenShift:       Unreachable (v4.15.3)
RAM Usage:       1.932GB of 10.96GB
Disk Usage:      19.05GB of 32.68GB (Inside the CRC VM)
Cache Usage:     26.27GB
Cache Directory: /home/tlavocat/.crc/cache

CRC config

crc config view
- consent-telemetry                     : yes

Host Operating System

NAME="Fedora Linux"
VERSION="39 (Workstation Edition)"
ID=fedora
VERSION_ID=39
VERSION_CODENAME=""
PLATFORM_ID="platform:f39"
PRETTY_NAME="Fedora Linux 39 (Workstation Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:39"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f39/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=39
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=39
SUPPORT_END=2024-11-12
VARIANT="Workstation Edition"
VARIANT_ID=workstation

Steps to reproduce

crc setup
crc start -p ~/Documents/pull-secret

Expected

I expect a normal startup.

Actual

 Error waiting for apiserver: Temporary error: ssh command error:
command : timeout 5s oc get nodes --context admin --cluster crc --kubeconfig /opt/kubeconfig
err     : Process exited with status 1
 (x222)

Logs

Before gather the logs try following if that fix your issue

$ crc delete -f
$ crc cleanup
$ crc setup
$ crc start --log-level debug

Please consider posting the output of crc start --log-level debug on http://gist.github.com/ and post the link in the issue.

See: https://gist.github.com/lavocatt/3f5d8fca0ac3ad52ddae982b906c3080

lavocatt commented 7 months ago

Update: restarting the command was successful at passing the step Waiting for kube-apiserver availability but failed on Waiting until the user's pull secret is written to the instance disk... afterward. Could it be linked to: https://github.com/crc-org/crc/issues/4036?

lavocatt commented 7 months ago

DEBU Running SSH command: timeout 5s oc get secret --context admin --cluster crc --kubeconfig /opt/kubeconfig 
DEBU SSH command results: err: <nil>, output: NAME                       TYPE                                  DATA   AGE
builder-dockercfg-z4hls    kubernetes.io/dockercfg               1      13d
builder-token-58b2c        kubernetes.io/service-account-token   4      14d
default-dockercfg-mwkzf    kubernetes.io/dockercfg               1      13d
default-token-snsbf        kubernetes.io/service-account-token   4      14d
deployer-dockercfg-qr92t   kubernetes.io/dockercfg               1      13d
deployer-token-2wc2f       kubernetes.io/service-account-token   4      14d 
DEBU NAME                       TYPE                                  DATA   AGE
builder-dockercfg-z4hls    kubernetes.io/dockercfg               1      13d
builder-token-58b2c        kubernetes.io/service-account-token   4      14d
default-dockercfg-mwkzf    kubernetes.io/dockercfg               1      13d
default-token-snsbf        kubernetes.io/service-account-token   4      14d
deployer-dockercfg-qr92t   kubernetes.io/dockercfg               1      13d
deployer-token-2wc2f       kubernetes.io/service-account-token   4      14d 
DEBU Running SSH command: <hidden>                
DEBU SSH command succeeded                        
DEBU Waiting for availability of resource type 'machineconfigs' 
DEBU retry loop: attempt 0                        
DEBU Running SSH command: timeout 5s oc get machineconfigs --context admin --cluster crc --kubeconfig /opt/kubeconfig 
DEBU SSH command results: err: <nil>, output: NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
00-worker                                          b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-master-container-runtime                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-master-kubelet                                  b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-worker-container-runtime                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-worker-kubelet                                  b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
97-master-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
97-worker-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
98-master-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
98-worker-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-master-generated-registries                     b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-master-ssh                                                                                 3.2.0             14d
99-node-sizing-for-crc                                                                        3.2.0             14d
99-openshift-machineconfig-master-dummy-networks                                              3.2.0             14d
99-worker-generated-registries                     b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-worker-ssh                                                                                 3.2.0             14d
rendered-master-01227552b0ddd4e8983d4ee6c0c8dcef   b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             13d
rendered-worker-ebd0dd73a3bb2c3974af877858d1f047   b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             13d 
DEBU NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
00-worker                                          b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-master-container-runtime                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-master-kubelet                                  b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-worker-container-runtime                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-worker-kubelet                                  b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
97-master-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
97-worker-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
98-master-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
98-worker-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-master-generated-registries                     b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-master-ssh                                                                                 3.2.0             14d
99-node-sizing-for-crc                                                                        3.2.0             14d
99-openshift-machineconfig-master-dummy-networks                                              3.2.0             14d
99-worker-generated-registries                     b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-worker-ssh                                                                                 3.2.0             14d
rendered-master-01227552b0ddd4e8983d4ee6c0c8dcef   b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             13d
rendered-worker-ebd0dd73a3bb2c3974af877858d1f047   b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             13d 
DEBU Running SSH command: timeout 30s oc get machineconfigs 99-master-ssh -o jsonpath='{.spec.config.passwd.users[0].sshAuthorizedKeys[0]}' --context admin --cluster crc --kubeconfig /opt/kubeconfig 
DEBU SSH command results: err: <nil>, output: ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAG+WEPCXfkqRWah/UPVqZx9DMMVGc9k99WO+hZ8lKQXUeKZ7DxES0PJs6SZquOGUBlUkvJ7o5H9Jn1PpuNJy6QMawAboAY6p+tq9nNwjGiVThXhqKxV+fwxCfh6LBPcdjtvuwbbrij6MtOoi0qGAUCIbRVexzD+UuhJEk+CrPEbi5XlbA== 
INFO Waiting until the user's pull secret is written to the instance disk... 
DEBU retry loop: attempt 0                        
DEBU Running SSH command: <hidden>                
DEBU SSH command succeeded                        
DEBU error: Temporary error: pull secret not updated to disk - sleeping 2s 
DEBU retry loop: attempt 1                        
DEBU Running SSH command: <hidden>

lavocatt commented 7 months ago

Discussion with @praveenkumar and @cfergeau we found out that the issue was linked to the DNS resolution. This is the working configuration:

crc config view
- consent-telemetry                     : no
- network-mode                          : user
- pull-secret-file                      : /home/tlavocat/Documents/pull-secret

Thanks for your help!

cfergeau commented 7 months ago

Would stil be nice to fix the DNS resolution problem you were having :)

lavocatt commented 7 months ago

Would stil be nice to fix the DNS resolution problem you were having :)

I can reopen the issue if you want

jmazzitelli commented 7 months ago

I'm getting this same error. I can't tell - is there a solution?

I do have these set, which I thought would do it, but does not:

crc config set network-mode user
crc config set host-network-access true

This is with v2.34.1

gbraad commented 7 months ago

Would stil be nice to fix the DNS resolution problem you were having :)

DNS resolution is a network setup issue, ... What was the DNS server set to? Any specific would help to determine what causes a lookup failure.

network-mode will force the user-mode network stack and resolve on the host. This would not guarantee success.

More info about the DNS used is necessary to determine the problem.

jmazzitelli commented 7 months ago

Well, that's interesting. I turned off my corporate VPN and it started fine.

What would that indicate? Is there something I can do for this to work if I am on the VPN? I would like to remain on VPN if possible.

Note, when I just got this to work, I did have network-mode user and host-network-access true set explicitly (it must have been configured previously, I did not mean to set this, but I did have them set for a prior run). This is what crc config view shows:

$ crc config view
- consent-telemetry                     : no
- cpus                                  : 6
- disable-update-check                  : true
- disk-size                             : 48
- host-network-access                   : true
- kubeadmin-password                    : kiali
- memory                                : 32768
- network-mode                          : user

I can try it without those settings just to see what happens. But at minimum, the above works for me when I am off the VPN.

jmazzitelli commented 7 months ago

OK, this is interesting. It is hung again. This is what I did.

I completely removed my $HOME/.crc directory. I then started crc v2.34.1 (passing in my pull secret via the cmdline option). I am NOT connected to the VPN. My config is:

$ crc config view
- consent-telemetry                     : no
- cpus                                  : 6
- disable-update-check                  : true
- disk-size                             : 48
- kubeadmin-password                    : kiali
- memory                                : 32768
- pull-secret-file                      : /home/jmazzite/bin/downloads/openshift-pull-secret.txt

So it seems like those network-mode/host-network-access had some positive effect when I am not on VPN.

I am going to try this test again. I will completely remove $HOME/.crc again, only this time I will explicitly set those two network config options while remaining off the VPN. I'll see if that works.

(and I would like to repeat - none of these problems occur with v2.32.0 - so something changed between that version of CRC and v2.34.1 that is causing this problem)

jmazzitelli commented 7 months ago

It failed when I turned off those network settings, but now, even worse, after control-c'ing the hung crc process, I can't restart anything. Continue to get the following:

Is 'crc daemon' running? Cannot reach daemon API: Get "http://unix/api/version": dial unix /home/jmazzite/.crc/crc-http.sock: connect: no such file or directory

I see in my crc output logs:

DEBU Running 'systemctl --user status crc-daemon.service' 
DEBU Command failed: exit status 3                
DEBU stdout: * crc-daemon.service - CRC daemon
     Loaded: loaded (/home/jmazzite/.config/systemd/user/crc-daemon.service; static)
    Drop-In: /usr/lib/systemd/user/service.d
             `-10-timeout-abort.conf
     Active: inactive (dead)

I ran crc cleanup and I think it might be getting further along, but I can't tell. With debug on, I see this continually scrolling:

DEBU Running SSH command: timeout 5s oc get nodes --context admin --cluster crc --kubeconfig /opt/kubeconfig 
DEBU SSH command results: err: Process exited with status 1, output:  
DEBU E0426 16:51:24.979560    8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.980018    8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.981353    8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.981675    8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.982956    8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
Unable to connect to the server: dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host 
DEBU error: Temporary error: ssh command error:
command : timeout 5s oc get nodes --context admin --cluster crc --kubeconfig /opt/kubeconfig
err     : Process exited with status 1

jmazzitelli commented 7 months ago

I'm back able to start (but this time I went back to 2.32.0). I'm staying on that version for now :) I am on Fedora 39 (RedHat CSB machine) - and I know one other person who sees the same problem as I do on 2.34.1 on the same operating system. I can likely help work on debugging this with someone on Monday - but for now I'm going to continue using with 2.32.0 and ignore upgrading.

jmazzitelli commented 7 months ago

This is the log with the "hang" problem as it tries to store the pull secret.

crc-problem.txt

If you want to see the full series of calls I made to set up crc, I wrap it in this hack script.

I was NOT on the VPN, and I did NOT have the network configs set.

$ crc config view

- consent-telemetry                     : no
- cpus                                  : 6
- disable-update-check                  : true
- disk-size                             : 48
- kubeadmin-password                    : kiali
- memory                                : 32768
- pull-secret-file                      : /home/jmazzite/bin/downloads/openshift-pull-secret.txt

jmazzitelli commented 7 months ago

And here is the CRC logs when I can start it successfully:

crc-success.txt

$ crc version
CRC version: 2.34.1+b470b5
OpenShift version: 4.15.3
Podman version: 4.4.4

$ oc get nodes
NAME   STATUS   ROLES                         AGE   VERSION
crc    Ready    control-plane,master,worker   32d   v1.28.7+6e2789b

$ oc whoami --show-server
https://api.crc.testing:6443

I was NOT on the VPN but I DID have the network configs set:

$ crc config view
- consent-telemetry                     : no
- cpus                                  : 6
- disable-update-check                  : true
- disk-size                             : 48
- host-network-access                   : true
- kubeadmin-password                    : kiali
- memory                                : 32768
- network-mode                          : user
- pull-secret-file                      : /home/jmazzite/bin/downloads/openshift-pull-secret.txt

(There is one other less important thing that is interesting. Trying to connect via ssh to the CRC cluster doesn't work like it does when I can start it on 2.32. This command:

ssh -y -i ~/.crc/machines/crc/id_ecdsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null core@$(crc ip) ls

would normally just work and not require me to enter a password, but now it is asking me for a password. The -i identity should let me in though. So I don't know what is different. Maybe things changed in 2.34.1 wrt ssh access?)

praveenkumar commented 6 months ago

After investigating with Fedora-CSB laptops looks like the systemd-resolved service is running but not consumed by NetworkManager so crc's preflight checks add the dispatcher file which is doesn't work since /etc/resolv.conf is not managed by systemd-resolved and in this case in the VM internal dns resolutions fails. We have 2 different workaround

Workaround-1: Disable/mask the systemd-resolved and follow the steps

$ sudo systemctl stop systemd-resolved.service
$ sudo systemctl mask systemd-resolved.service
$ crc delete && crc cleanup && crc setup && crc start

Workaround-2: Make sure /etc/resolv.conf managed by systemd-resolved

$ sudo rm -fr /etc/resolv.conf
$ sudo ln -s ../run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
$ sudo systemctl restart NetworkManager
$ crc delete && crc cleanup && crc setup && crc start

jmazzitelli commented 6 months ago

Workaround-1: Disable/mask the systemd-resolved and follow the steps
$ sudo systemctl stop systemd-resolved.service
$ sudo systemctl mask systemd-resolved.service
$ crc delete && crc cleanup && crc setup && crc start

I can confirm the above workaround works for me.

jmazzitelli commented 6 months ago

@praveenkumar just so I'm clear - the workarounds you mentioned, are those going to be the final solution for those with this problem? Or is the plan for CRC to change in a future version so the workarounds will not be needed?

phantomjinx commented 6 months ago

The workaround does not apply to my setup as I routinely disable system-resolved and run by own DNS server through docker.

[phantomjinx@microraptor:~/.crc/cache] 4 $ service systemd-resolved status
Redirecting to /bin/systemctl status systemd-resolved.service
○ systemd-resolved.service - Network Name Resolution
     Loaded: loaded (/usr/lib/systemd/system/systemd-resolved.service; disabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: inactive (dead)
       Docs: man:systemd-resolved.service(8)
             man:org.freedesktop.resolve1(5)
             https://www.freedesktop.org/wiki/Software/systemd/writing-network-configuration-managers
             https://www.freedesktop.org/wiki/Software/systemd/writing-resolver-clients

see #4144 for details of my configuration.

ferhoyos commented 6 months ago

By applying the workaround, I can execute OCP 4.15 on my Fedora 39, but as a side effect, I am unable to run a Minikube cluster

😄 minikube v1.32.0 on Fedora 39 ✨ Using the kvm2 driver based on existing profile 👍 Starting control plane node minikube in cluster minikube 🔄 Restarting existing kvm2 VM for "minikube" ... :face_palm: StartHost failed, but will try again: driver start: ensuring active networks: starting network default: virError(Code=1, Domain=0, Message='internal error: Network is already in use by interface virbr0') 🔄 Restarting existing kvm2 VM for "minikube" ... 😿 Failed to start kvm2 VM. Running "minikube delete" may fix it: driver start: ensuring active networks: starting network default: virError(Code=1, Domain=0, Message='internal error: Network is already in use by interface virbr0') ❌ Exiting due to GUEST_PROVISION: error provisioning guest: Failed to start host: driver start: ensuring active networks: starting network default: virError(Code=1, Domain=0, Message='internal error: Network is already in use by interface virbr0')

If this happens, you have to unmask and then enable systemd-resolved service again to fix the minikube issue:

sudo systemctl unmask systemd-resolved
sudo systemctl enable systemd-resolved
sudo systemctl start systemd-resolved

jmazzitelli commented 6 months ago

Just want to close the loop on this - all my problems seem to have gone away with the new CRC v2.36.0. I can run CRC without messing with systemd-resolved service and I am still able to also start minikube successfully (no longer get that error Network is already in use by interface virbr0')

crc-org / crc