Closed lavocatt closed 7 months ago
Update: restarting the command was successful at passing the step Waiting for kube-apiserver availability
but failed on Waiting until the user's pull secret is written to the instance disk...
afterward.
Could it be linked to: https://github.com/crc-org/crc/issues/4036?
DEBU Running SSH command: timeout 5s oc get secret --context admin --cluster crc --kubeconfig /opt/kubeconfig
DEBU SSH command results: err: <nil>, output: NAME TYPE DATA AGE
builder-dockercfg-z4hls kubernetes.io/dockercfg 1 13d
builder-token-58b2c kubernetes.io/service-account-token 4 14d
default-dockercfg-mwkzf kubernetes.io/dockercfg 1 13d
default-token-snsbf kubernetes.io/service-account-token 4 14d
deployer-dockercfg-qr92t kubernetes.io/dockercfg 1 13d
deployer-token-2wc2f kubernetes.io/service-account-token 4 14d
DEBU NAME TYPE DATA AGE
builder-dockercfg-z4hls kubernetes.io/dockercfg 1 13d
builder-token-58b2c kubernetes.io/service-account-token 4 14d
default-dockercfg-mwkzf kubernetes.io/dockercfg 1 13d
default-token-snsbf kubernetes.io/service-account-token 4 14d
deployer-dockercfg-qr92t kubernetes.io/dockercfg 1 13d
deployer-token-2wc2f kubernetes.io/service-account-token 4 14d
DEBU Running SSH command: <hidden>
DEBU SSH command succeeded
DEBU Waiting for availability of resource type 'machineconfigs'
DEBU retry loop: attempt 0
DEBU Running SSH command: timeout 5s oc get machineconfigs --context admin --cluster crc --kubeconfig /opt/kubeconfig
DEBU SSH command results: err: <nil>, output: NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE
00-master b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
00-worker b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
01-master-container-runtime b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
01-master-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
01-worker-container-runtime b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
01-worker-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
97-master-generated-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
97-worker-generated-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
98-master-generated-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
98-worker-generated-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
99-master-generated-registries b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
99-master-ssh 3.2.0 14d
99-node-sizing-for-crc 3.2.0 14d
99-openshift-machineconfig-master-dummy-networks 3.2.0 14d
99-worker-generated-registries b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
99-worker-ssh 3.2.0 14d
rendered-master-01227552b0ddd4e8983d4ee6c0c8dcef b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 13d
rendered-worker-ebd0dd73a3bb2c3974af877858d1f047 b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 13d
DEBU NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE
00-master b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
00-worker b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
01-master-container-runtime b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
01-master-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
01-worker-container-runtime b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
01-worker-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
97-master-generated-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
97-worker-generated-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
98-master-generated-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
98-worker-generated-kubelet b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
99-master-generated-registries b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
99-master-ssh 3.2.0 14d
99-node-sizing-for-crc 3.2.0 14d
99-openshift-machineconfig-master-dummy-networks 3.2.0 14d
99-worker-generated-registries b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 14d
99-worker-ssh 3.2.0 14d
rendered-master-01227552b0ddd4e8983d4ee6c0c8dcef b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 13d
rendered-worker-ebd0dd73a3bb2c3974af877858d1f047 b693d2ffbac7ddab712901662fabffe2fda81562 3.4.0 13d
DEBU Running SSH command: timeout 30s oc get machineconfigs 99-master-ssh -o jsonpath='{.spec.config.passwd.users[0].sshAuthorizedKeys[0]}' --context admin --cluster crc --kubeconfig /opt/kubeconfig
DEBU SSH command results: err: <nil>, output: ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAG+WEPCXfkqRWah/UPVqZx9DMMVGc9k99WO+hZ8lKQXUeKZ7DxES0PJs6SZquOGUBlUkvJ7o5H9Jn1PpuNJy6QMawAboAY6p+tq9nNwjGiVThXhqKxV+fwxCfh6LBPcdjtvuwbbrij6MtOoi0qGAUCIbRVexzD+UuhJEk+CrPEbi5XlbA==
INFO Waiting until the user's pull secret is written to the instance disk...
DEBU retry loop: attempt 0
DEBU Running SSH command: <hidden>
DEBU SSH command succeeded
DEBU error: Temporary error: pull secret not updated to disk - sleeping 2s
DEBU retry loop: attempt 1
DEBU Running SSH command: <hidden>
Discussion with @praveenkumar and @cfergeau we found out that the issue was linked to the DNS resolution. This is the working configuration:
crc config view
- consent-telemetry : no
- network-mode : user
- pull-secret-file : /home/tlavocat/Documents/pull-secret
Thanks for your help!
Would stil be nice to fix the DNS resolution problem you were having :)
Would stil be nice to fix the DNS resolution problem you were having :)
I can reopen the issue if you want
I'm getting this same error. I can't tell - is there a solution?
I do have these set, which I thought would do it, but does not:
crc config set network-mode user
crc config set host-network-access true
This is with v2.34.1
Would stil be nice to fix the DNS resolution problem you were having :)
DNS resolution is a network setup issue, ... What was the DNS server set to? Any specific would help to determine what causes a lookup failure.
network-mode
will force the user-mode network stack and resolve on the host. This would not guarantee success.
More info about the DNS used is necessary to determine the problem.
Well, that's interesting. I turned off my corporate VPN and it started fine.
What would that indicate? Is there something I can do for this to work if I am on the VPN? I would like to remain on VPN if possible.
Note, when I just got this to work, I did have network-mode user
and host-network-access true
set explicitly (it must have been configured previously, I did not mean to set this, but I did have them set for a prior run). This is what crc config view shows:
$ crc config view
- consent-telemetry : no
- cpus : 6
- disable-update-check : true
- disk-size : 48
- host-network-access : true
- kubeadmin-password : kiali
- memory : 32768
- network-mode : user
I can try it without those settings just to see what happens. But at minimum, the above works for me when I am off the VPN.
OK, this is interesting. It is hung again. This is what I did.
I completely removed my $HOME/.crc
directory.
I then started crc v2.34.1 (passing in my pull secret via the cmdline option).
I am NOT connected to the VPN.
My config is:
$ crc config view
- consent-telemetry : no
- cpus : 6
- disable-update-check : true
- disk-size : 48
- kubeadmin-password : kiali
- memory : 32768
- pull-secret-file : /home/jmazzite/bin/downloads/openshift-pull-secret.txt
So it seems like those network-mode/host-network-access had some positive effect when I am not on VPN.
I am going to try this test again. I will completely remove $HOME/.crc again, only this time I will explicitly set those two network config options while remaining off the VPN. I'll see if that works.
(and I would like to repeat - none of these problems occur with v2.32.0 - so something changed between that version of CRC and v2.34.1 that is causing this problem)
It failed when I turned off those network settings, but now, even worse, after control-c'ing the hung crc process, I can't restart anything. Continue to get the following:
Is 'crc daemon' running? Cannot reach daemon API: Get "http://unix/api/version": dial unix /home/jmazzite/.crc/crc-http.sock: connect: no such file or directory
I see in my crc output logs:
DEBU Running 'systemctl --user status crc-daemon.service'
DEBU Command failed: exit status 3
DEBU stdout: * crc-daemon.service - CRC daemon
Loaded: loaded (/home/jmazzite/.config/systemd/user/crc-daemon.service; static)
Drop-In: /usr/lib/systemd/user/service.d
`-10-timeout-abort.conf
Active: inactive (dead)
I ran crc cleanup
and I think it might be getting further along, but I can't tell. With debug on, I see this continually scrolling:
DEBU Running SSH command: timeout 5s oc get nodes --context admin --cluster crc --kubeconfig /opt/kubeconfig
DEBU SSH command results: err: Process exited with status 1, output:
DEBU E0426 16:51:24.979560 8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.980018 8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.981353 8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.981675 8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.982956 8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
Unable to connect to the server: dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
DEBU error: Temporary error: ssh command error:
command : timeout 5s oc get nodes --context admin --cluster crc --kubeconfig /opt/kubeconfig
err : Process exited with status 1
I'm back able to start (but this time I went back to 2.32.0). I'm staying on that version for now :) I am on Fedora 39 (RedHat CSB machine) - and I know one other person who sees the same problem as I do on 2.34.1 on the same operating system. I can likely help work on debugging this with someone on Monday - but for now I'm going to continue using with 2.32.0 and ignore upgrading.
This is the log with the "hang" problem as it tries to store the pull secret.
If you want to see the full series of calls I made to set up crc, I wrap it in this hack script.
I was NOT on the VPN, and I did NOT have the network configs set.
$ crc config view
- consent-telemetry : no
- cpus : 6
- disable-update-check : true
- disk-size : 48
- kubeadmin-password : kiali
- memory : 32768
- pull-secret-file : /home/jmazzite/bin/downloads/openshift-pull-secret.txt
And here is the CRC logs when I can start it successfully:
$ crc version
CRC version: 2.34.1+b470b5
OpenShift version: 4.15.3
Podman version: 4.4.4
$ oc get nodes
NAME STATUS ROLES AGE VERSION
crc Ready control-plane,master,worker 32d v1.28.7+6e2789b
$ oc whoami --show-server
https://api.crc.testing:6443
I was NOT on the VPN but I DID have the network configs set:
$ crc config view
- consent-telemetry : no
- cpus : 6
- disable-update-check : true
- disk-size : 48
- host-network-access : true
- kubeadmin-password : kiali
- memory : 32768
- network-mode : user
- pull-secret-file : /home/jmazzite/bin/downloads/openshift-pull-secret.txt
(There is one other less important thing that is interesting. Trying to connect via ssh to the CRC cluster doesn't work like it does when I can start it on 2.32. This command:
ssh -y -i ~/.crc/machines/crc/id_ecdsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null core@$(crc ip) ls
would normally just work and not require me to enter a password, but now it is asking me for a password. The -i identity should let me in though. So I don't know what is different. Maybe things changed in 2.34.1 wrt ssh access?)
After investigating with Fedora-CSB laptops looks like the systemd-resolved
service is running but not consumed by NetworkManager
so crc's preflight checks add the dispatcher file which is doesn't work since /etc/resolv.conf
is not managed by systemd-resolved
and in this case in the VM internal dns resolutions fails. We have 2 different workaround
Workaround-1: Disable/mask the systemd-resolved
and follow the steps
$ sudo systemctl stop systemd-resolved.service
$ sudo systemctl mask systemd-resolved.service
$ crc delete && crc cleanup && crc setup && crc start
Workaround-2: Make sure /etc/resolv.conf
managed by systemd-resolved
$ sudo rm -fr /etc/resolv.conf
$ sudo ln -s ../run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
$ sudo systemctl restart NetworkManager
$ crc delete && crc cleanup && crc setup && crc start
Workaround-1: Disable/mask the
systemd-resolved
and follow the steps$ sudo systemctl stop systemd-resolved.service $ sudo systemctl mask systemd-resolved.service $ crc delete && crc cleanup && crc setup && crc start
I can confirm the above workaround works for me.
@praveenkumar just so I'm clear - the workarounds you mentioned, are those going to be the final solution for those with this problem? Or is the plan for CRC to change in a future version so the workarounds will not be needed?
The workaround does not apply to my setup as I routinely disable system-resolved and run by own DNS server through docker.
[phantomjinx@microraptor:~/.crc/cache] 4 $ service systemd-resolved status
Redirecting to /bin/systemctl status systemd-resolved.service
○ systemd-resolved.service - Network Name Resolution
Loaded: loaded (/usr/lib/systemd/system/systemd-resolved.service; disabled; preset: enabled)
Drop-In: /usr/lib/systemd/system/service.d
└─10-timeout-abort.conf
Active: inactive (dead)
Docs: man:systemd-resolved.service(8)
man:org.freedesktop.resolve1(5)
https://www.freedesktop.org/wiki/Software/systemd/writing-network-configuration-managers
https://www.freedesktop.org/wiki/Software/systemd/writing-resolver-clients
see #4144 for details of my configuration.
By applying the workaround, I can execute OCP 4.15 on my Fedora 39, but as a side effect, I am unable to run a Minikube cluster
😄 minikube v1.32.0 on Fedora 39 ✨ Using the kvm2 driver based on existing profile 👍 Starting control plane node minikube in cluster minikube 🔄 Restarting existing kvm2 VM for "minikube" ... :face_palm: StartHost failed, but will try again: driver start: ensuring active networks: starting network default: virError(Code=1, Domain=0, Message='internal error: Network is already in use by interface virbr0') 🔄 Restarting existing kvm2 VM for "minikube" ... 😿 Failed to start kvm2 VM. Running "minikube delete" may fix it: driver start: ensuring active networks: starting network default: virError(Code=1, Domain=0, Message='internal error: Network is already in use by interface virbr0') ❌ Exiting due to GUEST_PROVISION: error provisioning guest: Failed to start host: driver start: ensuring active networks: starting network default: virError(Code=1, Domain=0, Message='internal error: Network is already in use by interface virbr0')
If this happens, you have to unmask and then enable systemd-resolved
service again to fix the minikube issue:
sudo systemctl unmask systemd-resolved
sudo systemctl enable systemd-resolved
sudo systemctl start systemd-resolved
Just want to close the loop on this - all my problems seem to have gone away with the new CRC v2.36.0. I can run CRC without messing with systemd-resolved service and I am still able to also start minikube successfully (no longer get that error Network is already in use by interface virbr0'
)
General information
crc setup
before starting it (Yes/No)? yesCRC version
CRC status
CRC config
Host Operating System
Steps to reproduce
Expected
I expect a normal startup.
Actual
Logs
Before gather the logs try following if that fix your issue
Please consider posting the output of
crc start --log-level debug
on http://gist.github.com/ and post the link in the issue.See: https://gist.github.com/lavocatt/3f5d8fca0ac3ad52ddae982b906c3080