[BUG][Ubuntu][Bare metal] OpenShift cluster becomes Unreachable

dabcoder commented 1 year ago

General information

OS: Linux
Hypervisor: QEMU / libvirt
Did you run crc setup before starting it: Yes
Running CRC on: Baremetal-Server

CRC version

CRC version: 2.20.0+f3a947

CRC status

DEBU CRC version: 2.20.0+f3a947
DEBU OpenShift version: 4.13.0
DEBU Podman version: 4.4.4
DEBU Running 'crc status'
CRC VM:          Running
OpenShift:       Unreachable (v4.13.0)
Disk Usage:      0B of 0B (Inside the CRC VM)
Cache Usage:     26.29GB
Cache Directory: /home/ubuntu/.crc/cache

CRC config

- consent-telemetry                     : no
- cpus                                  : 12
- memory                                : 27648
- network-mode                          : user

Host Operating System

# Put the output of `cat /etc/os-release`
PRETTY_NAME="Ubuntu 22.04.2 LTS"

Steps to reproduce

Follow the instructions listed in https://github.com/crc-org/crc/wiki/Debugging-guide
crc ip returns 127.0.0.1
ssh -i ~/.crc/machines/crc/id_ecdsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null core@127.0.0.1 returns core@127.0.0.1: Permission denied (publickey).

Expected

SSH into the VM should be possible. However, could it be due to the fact that crc ip is returning the IP address of the loopback interface, and should return a 192.168.x.x IP instead?

ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 74:56:3c:56:f1:1e brd ff:ff:ff:ff:ff:ff
    inet 162.19.88.183/24 metric 100 brd 162.19.88.255 scope global dynamic enp3s0f0
       valid_lft 62233sec preferred_lft 62233sec
    inet6 2001:41d0:306:2bb7::/56 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::7656:3cff:fe56:f11e/64 scope link
       valid_lft forever preferred_lft forever
4: enp3s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 74:56:3c:56:f1:1f brd ff:ff:ff:ff:ff:ff
5: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 52:54:00:d5:4d:5a brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
       valid_lft forever preferred_lft forever

Worth noting that what prompted me to try to SSH into the VM is the fact that the OpenShift cluster is appearing as Unreachable (I've read through this previous issue by the way: https://github.com/crc-org/crc/issues/1888).

Logs

Before gather the logs try following if that fix your issue

$ crc delete -f
$ crc cleanup
$ crc setup
$ crc start --log-level debug

The cluster starts without visible issues, without needing to delete the existing install.

cfergeau commented 1 year ago

With user-mode networking (crc config get networking-mode), 127.0.0.1 is expected. In this configuration, the ssh port for the VM is remapped to 2222. Port 22 will try to connect to the host SSH daemon, not to the VM.

dabcoder commented 1 year ago

Thanks, added 2222 in the ssh config (which I was convinced I had previously done, but apparently not). I am now able to SSH in the VM. For the record:

crc config get network-mode
network-mode : user

gbraad commented 1 year ago

@cfergeau on a sidenote, does it make sense to add a documentation to describe the usage of this with an ssh config entry, and perhaps even set the correct key (or add configuration to use the user's ssh key?). WDYT?

cfergeau commented 1 year ago

This is already documented in https://github.com/crc-org/crc/wiki/Debugging-guide

dabcoder commented 1 year ago

Running into another issue now:

crc status
CRC VM:          Running
OpenShift:       Unreachable (v4.13.0)
Disk Usage:      0B of 0B (Inside the CRC VM)
Cache Usage:     26.29GB
Cache Directory: /home/ubuntu/.crc/cache

So the VM is Running, but I am getting a ssh: connect to host 127.0.0.1 port 2222: Connection refused error. Verbose output:

OpenSSH_8.9p1 Ubuntu-3ubuntu0.3, OpenSSL 3.0.2 15 Mar 2022
debug1: Reading configuration data /home/ubuntu/.ssh/config
debug1: /home/ubuntu/.ssh/config line 1: Applying options for crc
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug2: resolve_canonicalize: hostname 127.0.0.1 is address
debug3: ssh_connect_direct: entering
debug1: Connecting to 127.0.0.1 [127.0.0.1] port 2222.
debug3: set_sock_tos: set socket 3 IP_TOS 0x10
debug1: connect to address 127.0.0.1 port 2222: Connection refused
ssh: connect to host 127.0.0.1 port 2222: Connection refused

Here's the ssh config:

Host crc
    Hostname 127.0.0.1
    User core
    Port 2222
    IdentityFile ~/.crc/machines/crc/id_ecdsa
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null

sudo lsof -i:2222 returns no output. Have I missed anything here?

gbraad commented 1 year ago

It shows "unreachable" so something is afoot/weird; the status command itself uses SSH to check the disk usage. This returns 0, so it feels like something crashed inside the VM (or the daemon process). Can you check what the console says from the hypervisor management tools?

gbraad commented 1 year ago

https://github.com/crc-org/crc/wiki/Debugging-guide

This is wiki, not documentation.

dabcoder commented 1 year ago

It shows "unreachable" so something is afoot/weird; the status command itself uses SSH to check the disk usage. This returns 0, so it feels like something crashed inside the VM (or the daemon process). Can you check what the console says from the hypervisor management tools?

I am seeing frequent occurences of those logs lines produced by the libvirtd.service systemd unit:

Aug 22 13:01:18 ns3220936 libvirtd[226934]: 226934: error : virNetSocketReadWire:1793 : End of file while reading data: Input/output error

However, they also occur when the OpenShift cluster is in a Running state so I don't think this is the root cause of this issue. Since we're using QEMU / libvirt on this host, is there anything else I could check?

cfergeau commented 1 year ago

sudo virsh list --all will show the state of the crc VM. At this point, I'd try crc stop; crc start to see if it helps, or crc delete; crc start if you don't mind the cluster data loss. Another thing to try is to kill the crc daemon, and stop/start crc.

gbraad commented 1 year ago

Is the crc daemon as process still active? Since if that disappeared, the VM becomes unreachable.

gbraad commented 1 year ago

When the daemon process is 'stopped' on Windows, it responds back with:

PS> crc status
crc does not seem to be setup correctly, have you run 'crc setup'?

but when the process is restarted it shows what you show us:

PS> crc status
CRC VM:          Running
MicroShift:      Running (v4.13.0)
RAM Usage:       1.139GB of 3.902GB
Disk Usage:      8.216GB of 16.1GB (Inside the CRC VM)
Cache Usage:     20.72GB
Cache Directory: C:\Users\gbraad\.crc\cache
PS> crc status
crc does not seem to be setup correctly, have you run 'crc setup'?
PS> crc setup
INFO Using bundle path C:\Users\gbraad\.crc\cache\crc_microshift_hyperv_4.13.0_amd64.crcbundle
INFO Checking minimum RAM requirements
INFO Checking if current user is in crc-users and Hyper-V admins group
INFO Checking if CRC bundle is extracted in '$HOME/.crc'
INFO Checking if C:\Users\gbraad\.crc\cache\crc_microshift_hyperv_4.13.0_amd64.crcbundle exists
INFO Checking if the daemon task is installed
INFO Checking if the daemon task is running
INFO Running the daemon task
INFO Checking admin helper service is running
Your system is correctly setup for using CRC. Use 'crc start' to start the instance
PS> ssh core@localhost -p 2222 -I C:\Users\gbraad\.crc\machines\crc\id_ecdsa.pub
ssh: connect to host localhost port 2222: Connection refused
PS> crc status
CRC VM:          Running
MicroShift:      Unreachable (v4.13.0)
Disk Usage:      0B of 0B (Inside the CRC VM)
Cache Usage:     20.72GB
Cache Directory: C:\Users\gbraad\.crc\cache

Note: while tested on Windows, this should have similar behaviour on Linux

gbraad commented 1 year ago

On Linux the daemon is automagically restarted with socket activation. This means that when the process is gone (crashed), we just restart and provide a new instance. However, in that case no forwards have been setup. This is state that has been lost as we do not record this to be re-established.

Functionality can be restored by:

PS> crc stop
PS> crc start

PS> crc stop
INFO Stopping the instance, this may take a few minutes...
Stopped the instance
PS> crc start
WARN A new version (2.25.0) has been published on https://developers.redhat.com/content-gateway/file/pub/openshift-v4/clients/crc/2.25.0/crc-windows-installer.zip
INFO Checking minimum RAM requirements
INFO Checking if running in a shell with administrator rights
INFO Checking Windows release
INFO Checking Windows edition
INFO Checking if Hyper-V is installed and operational
INFO Checking if Hyper-V service is enabled
INFO Checking if crc-users group exists
INFO Checking if current user is in crc-users and Hyper-V admins group
INFO Checking if vsock is correctly configured
INFO Checking if the daemon task is installed
INFO Checking if the daemon task is running
INFO Checking admin helper service is running
INFO Loading bundle: crc_microshift_hyperv_4.13.0_amd64...
INFO Starting CRC VM for microshift 4.13.0...
INFO CRC instance is running with IP 127.0.0.1
INFO CRC VM is running
INFO Check internal and public DNS query...
INFO Check DNS query from host...
INFO Starting Microshift service... [takes around 1min]
INFO Waiting for kube-apiserver availability... [takes around 2min]
INFO Adding microshift context to kubeconfig...
Started the MicroShift cluster.

Use the 'oc' command line interface:
  PS> & crc oc-env | Invoke-Expression
  PS> oc COMMAND

PS> ssh core@localhost -p 2222 -I C:\Users\gbraad\.crc\machines\crc\id_ecdsa.pub
dlopen C:\\Users\\gbraad\\.crc\\machines\\crc\\id_ecdsa.pub failed: Access is denied.
core@localhost's password:   <!---- responds

Would like to understand what happened that the daemon stopped working/restarted, before we address the 'self-healing' (as this would hide the issue instead of addressing it directly).

dabcoder commented 1 year ago

Thanks for following up on that one. So the cluster has become Unreachable again, not sure exactly when that happened:

crc status --log-level debug
DEBU CRC version: 2.20.0+f3a947
DEBU OpenShift version: 4.13.0
DEBU Podman version: 4.4.4
DEBU Running 'crc status'
CRC VM:          Running
OpenShift:       Unreachable (v4.13.0)
Disk Usage:      0B of 0B (Inside the CRC VM)
Cache Usage:     26.29GB
Cache Directory: /home/ubuntu/.crc/cache

We can see that the virsh command output shows that crc is running:

sudo virsh list --all
 Id   Name   State
----------------------
 1    crc    running

I can also confirm that the cluster is back to a Running state if I run crc stop;|&& crc start. I am also still trying to figure out why the disk usage shows as 0B of 0B but I haven't found any additional hints at this point.

praveenkumar commented 1 year ago

why the disk usage shows as 0B of 0B but I haven't found any additional hints at this point

@dabcoder Disk usage is shown from VM prospective and that uses ssh connection to be working, when it is in Running state it will also show correct disk usage.

dabcoder commented 1 year ago

... when it is in Running state it will also show correct disk usage.

EDIT: didn't mean to close this issue. And yes indeed, it does show the current disk usage when the cluster is back in a Running state:

CRC VM:          Running
OpenShift:       Running (v4.13.0)
RAM Usage:       11.03GB of 30.79GB
Disk Usage:      27.53GB of 32.68GB (Inside the CRC VM)
Cache Usage:     26.29GB
Cache Directory: /home/ubuntu/.crc/cache

cfergeau commented 1 year ago

Could you check the daemon state when this happens? systemctl status --user crc-daemon.service

gbraad commented 1 year ago

This confirms it is related to the daemon process crashing or stopping.

Is there anything in the journalctl we can check, or the crcd.log ? Or systemctl --user show crc-daemon.service -p NRestarts to see the restart count.

dabcoder commented 1 year ago

OK so the daemon seems to has crashed:

crc status
CRC VM:          Running
OpenShift:       Unreachable (v4.13.0)
Disk Usage:      0B of 0B (Inside the CRC VM)
Cache Usage:     26.29GB
Cache Directory: /home/ubuntu/.crc/cache

VM still marked as running:

sudo virsh list --all
 Id   Name   State
----------------------
 2    crc    running

The daemon isn't restarting apparently:

systemctl --user show crc-daemon.service -p NRestarts
NRestarts=0

The daemon service still running, we can notice some errors though:

systemctl status --user crc-daemon.service
● crc-daemon.service - CRC daemon
     Loaded: loaded (/home/ubuntu/.config/systemd/user/crc-daemon.service; static)
     Active: active (running) since Mon 2023-08-28 08:15:17 UTC; 1min 57s ago
...
Aug 28 08:15:24 ns3220936 crc[294020]: W0828 08:15:24.584550  294020 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Aug 28 08:15:24 ns3220936 crc[294020]: W0828 08:15:24.761787  294020 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Aug 28 08:15:30 ns3220936 crc[294020]: @ - - [28/Aug/2023:08:15:30 +0000] "GET /api/status HTTP/1.1" 200 154

Perhaps some interesting hints in the crcd.log file?

time="2023-08-28T08:15:30Z" level=debug msg="Cannot get root partition usage: ssh command error:\ncommand : df -B1 --output=size,used,target /sysroot | tail -1\nerr     : dial tcp 127.0.0.1:2222: connect: connection refused\n"
time="2023-08-28T08:15:30Z" level=debug msg="Cannot get RAM usage: ssh command error:\ncommand : awk '/^Mem/ {print $2,$3}' <(free -b)\nerr     : dial tcp 127.0.0.1:2222: connect: connection refused\n"

Hope that it's useful. Thanks.

gbraad commented 1 year ago

NRestarts=0

Either the service was 'stopped' (like socket activation has a timeout) or a crash does not cause a socket activation to record this as a restart:

As it looks clear that

     Active: active (running) since Mon 2023-08-28 08:15:17 UTC; 1min 57s ago

... with an uptime of mere minute(s), would not allow the cluster to have fully started otherwise; this was restarted.

the error messages are describing the effects, as the 'connection refused' makes sense when thee is no forwarding anymore.

dabcoder commented 1 year ago

Either the service was 'stopped' (like socket activation has a timeout) or a crash does not cause a socket activation to record this as a restart

So this is what's set in the daemon service systemd unit file:

cat /home/ubuntu/.config/systemd/user/crc-daemon.service

[Unit]
Description=CRC daemon
Requires=crc-http.socket
Requires=crc-vsock.socket

[Service]
# This allows systemd to know when startup is not complete (for example, because of a preflight failure)
# daemon.SdNotify(false, daemon.SdNotifyReady) must be called before the startup is successful
Type=notify
ExecStart=/home/ubuntu/.crc/bin/crc daemon

And this is what we have for each unit: crc-http.socket and crc-vsock.socket :

cat /home/ubuntu/.config/systemd/user/crc-http.socket
[Unit]
Description=CRC HTTP socket

[Socket]
ListenStream=%h/.crc/crc-http.sock
Service=crc-daemon.service

[Install]
WantedBy=default.target

cat /home/ubuntu/.config/systemd/user/crc-vsock.socket
[Unit]
Description=CRC vsock socket

[Socket]
ListenStream=vsock::1024
Service=crc-daemon.service

[Install]
WantedBy=default.target

Is there anything specific we could do there, in order to avoid the daemon to be stopped?

praveenkumar commented 1 year ago

@dabcoder what version of systemd ubuntu have? Also can you provide journalctl logs for service journalctl --user -u crc-daemon ?

dabcoder commented 1 year ago

@dabcoder what version of systemd ubuntu have? Also can you provide journalctl logs for service journalctl --user -u crc-daemon ?

systemd --version
systemd 249 (249.11-0ubuntu3.7)

journalctl logs attached (last 500 lines). crc-daemon.log

praveenkumar commented 1 year ago

@dabcoder As per daemon logs there is continue warning around endpoint not connected which is coming from https://github.com/google/gvisor/blob/master/pkg/tcpip/errors.go#L448 , will inspect what is happening and ask for more info.

Sep 04 13:20:53 ns3220936 crc[356229]: level=info msg="new connection from vm(3):2383898004 to vm(4294967295):1024"
Sep 04 13:21:56 ns3220936 crc[356229]: W0904 13:21:56.762835  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:56 ns3220936 crc[356229]: W0904 13:21:56.853472  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.309752  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.346149  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.804982  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.831189  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.850099  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.921616  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:58 ns3220936 crc[356229]: W0904 13:21:58.009339  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected

gbraad commented 1 year ago

the error messages are describing the effects, as the 'connection refused' makes sense when thee is no forwarding anymore. https://github.com/crc-org/crc/issues/3794#issuecomment-1698522621

We do not have an understanding yet why the daemon restarts/crashes. This is what we need to figure out. It is not Systemd version related as the behaviour from ther service is as expected; it (re)starts on a socket request.

endpoint not connected

This is understandable as the forward is gone... and every status call will do this (as you can see, every 2 secs).

You also run podman desktop with the extension, right?

cfergeau commented 1 year ago

Would be interesting to see if there are logs from around the time it crashed, though there also seems to be 'legitimate' restarts of the daemon. Can you try something like journalctl --user -u crc-daemon.service |grep -B 50 "Starting CRC daemon"?

gbraad commented 1 year ago

Reading part of the log does raise eyebrows:

Aug 30 16:04:38 ns3220936 crc[327745]: W0830 16:04:38.708015  327745 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Aug 30 16:12:56 ns3220936 systemd[327429]: Stopping CRC daemon...
Aug 30 16:12:56 ns3220936 systemd[327429]: Stopped CRC daemon.
Aug 30 16:12:56 ns3220936 systemd[327429]: crc-daemon.service: Consumed 2min 6.169s CPU time.
Sep 04 13:20:53 ns3220936 systemd[356017]: Starting CRC daemon...
Sep 04 13:20:53 ns3220936 crc[356229]: level=info msg="using socket provided by crc-vsock.socket"
Sep 04 13:20:53 ns3220936 crc[356229]: level=info msg="using socket provided by crc-http.socket"
Sep 04 13:20:53 ns3220936 systemd[356017]: Started CRC daemon.
Sep 04 13:20:53 ns3220936 crc[356229]: level=info msg="new connection from vm(3):2383898004 to vm(4294967295):1024"
Sep 04 13:21:56 ns3220936 crc[356229]: W0904 13:21:56.762835  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected

it shows the errors and a clean 'stopped' about 8 mins later. This might be user interaction?

but the one after the started fails very quickly after this, like within a minute?

dabcoder commented 1 year ago

You also run podman desktop with the extension, right?

We're using a recent version of podman for Ubuntu which we installed following these instructions: https://podman.io/docs/installation#ubuntu (so not podman desktop).

Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.18.1
Built:        Thu Jan  1 00:00:00 1970
OS/Arch:      linux/amd64

Can you try something like journalctl --user -u crc-daemon.service |grep -B 50 "Starting CRC daemon"?

Just produced the output and we can indeed see:

Sep 04 17:24:51 ns3220936 crc[356229]: W0904 17:24:51.075500  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 17:24:57 ns3220936 crc[356229]: W0904 17:24:57.018924  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 17:24:57 ns3220936 crc[356229]: W0904 17:24:57.449783  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 17:41:06 ns3220936 crc[356229]: W0904 17:41:06.743005  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 17:41:07 ns3220936 crc[356229]: W0904 17:41:07.262890  356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 17:44:55 ns3220936 systemd[356017]: Stopping CRC daemon...
Sep 04 17:44:55 ns3220936 systemd[356017]: Stopped CRC daemon.
Sep 04 17:44:55 ns3220936 systemd[356017]: crc-daemon.service: Consumed 1min 31.444s CPU time.
Sep 06 08:17:01 ns3220936 systemd[376139]: Starting CRC daemon...

This stopping CRC daemon action on the 4th of September 5.44pm UTC wasn't manual or user initiated so to speak. It seems to me that the daemon stopped and only restarted on the 6th of September (without user interaction since I didn't start it earlier today).

dabcoder commented 1 year ago

it shows the errors and a clean 'stopped' about 8 mins later. This might be user interaction?

Aug 30 16:12:56 ns3220936 systemd[327429]: Stopping CRC daemon...
Aug 30 16:12:56 ns3220936 systemd[327429]: Stopped CRC daemon.
Aug 30 16:12:56 ns3220936 systemd[327429]: crc-daemon.service: Consumed 2min 6.169s CPU time.
Sep 04 13:20:53 ns3220936 systemd[356017]: Starting CRC daemon...

Those ones were user related, yes, manual stop and start.

but the one after the started fails very quickly after this, like within a minute?

I didn't actually check the OpenShift state, whether it was Running or Unreachable at that time unfortunately.

praveenkumar commented 1 year ago

Is there any ubuntu security which is causing the daemon failure or endpoint connection issue? Tomorrow I will try to setup a ubuntu system and see if able to reproduce.

dabcoder commented 1 year ago

Is there any ubuntu security which is causing the daemon failure or endpoint connection issue? Tomorrow I will try to setup a ubuntu system and see if able to reproduce.

No I don't think so. I couldn't find anything that would cause this or interfere with any (local) connection.

dabcoder commented 1 year ago

For the record, we're spinning up crc on another VM/instance (CentOS this time). Will report findings tomorrow (whether we're noticing the same crashes/unavailability of the cluster).

dabcoder commented 1 year ago

... and so far the OpenShift cluster seems stable on that CentOS host. Minus crc which isn't reporting any status.

 crc status --log-level debug
DEBU CRC version: 2.26.0+233df0
DEBU OpenShift version: 4.13.9
DEBU Podman version: 4.4.4
DEBU Running 'crc status'
crc does not seem to be setup correctly, have you run 'crc setup'?

Yet I did, and then I also ran crc start to bootstrap the cluster, the Kubernetes API is responding and I'm able to use the oc CLI to query it as expected. So that is a different issue.

So on the Ubuntu host we have:

virsh version
Compiled against library: libvirt 8.0.0
Using library: libvirt 8.0.0
Using API: QEMU 8.0.0
Running hypervisor: QEMU 6.2.0

And on the CentOS:

virsh version
Compiled against library: libvirt 9.5.0
Using library: libvirt 9.5.0
Using API: QEMU 9.5.0
Running hypervisor: QEMU 8.0.0

(more recent libvirt version as well as QEMU API). Might play a role in that cluster instability issue?

gbraad commented 1 year ago

We have seen some inconsistencies before with Ubuntu and our clusterm but this concerned a Ryzen/Epyc CPU

On Fri, Sep 8, 2023 at 5:34 PM David B. @.***> wrote:

... and so far the OpenShift cluster seems stable on that CentOS host. Minus crc which isn't reporting any status.

crc status --log-level debug DEBU CRC version: 2.26.0+233df0 DEBU OpenShift version: 4.13.9 DEBU Podman version: 4.4.4 DEBU Running 'crc status' crc does not seem to be setup correctly, have you run 'crc setup'?

Yet I did, and then I also ran crc start to bootstrap the cluster, the Kubernetes API is responding and I'm able to use the oc CLI to query it as expected. So that is a different issue.

So on the Ubuntu host we have:

virsh version Compiled against library: libvirt 8.0.0 Using library: libvirt 8.0.0 Using API: QEMU 8.0.0 Running hypervisor: QEMU 6.2.0

And on the CentOS:

virsh version Compiled against library: libvirt 9.5.0 Using library: libvirt 9.5.0 Using API: QEMU 9.5.0 Running hypervisor: QEMU 8.0.0

(more recent libvirt version as well as QEMU API). Might play a role in that cluster instability issue?

— Reply to this email directly, view it on GitHub https://github.com/crc-org/crc/issues/3794#issuecomment-1711371010, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAOZSGHMCDWH2QCGXJTRTXZLRA3ANCNFSM6AAAAAA3VRMY7U . You are receiving this because you commented.Message ID: @.***>

--

Gerard Braad | http://gbraad.nl

STEM is the source, but BUILD is how you develop! [ Better Understanding Involves Learning and Doing ]

xiormeesh commented 4 months ago

same thing trying out crc on Ubuntu 22.04.4, trying to create the instance

INFO Starting CRC VM for openshift 4.15.10...     
INFO CRC instance is running with IP 192.168.130.11 
INFO CRC VM is running                            
INFO Updating authorized keys...                  
INFO Configuring shared directories               
INFO Check internal and public DNS query...       
INFO Check DNS query from host...                 
WARN Wildcard DNS resolution for apps-crc.testing does not appear to be working 
INFO Verifying validity of the kubelet certificates... 
INFO Starting kubelet service                     
INFO Waiting for kube-apiserver availability... [takes around 2min] 
Error waiting for apiserver: Temporary error: ssh command error:
command : timeout 5s oc get nodes --context admin --cluster crc --kubeconfig /opt/kubeconfig
err     : Process exited with status 1
 (x221)

when checking the status it shows running but unreachable

crc status --log-level debug
DEBU CRC version: 2.35.0+3956e8                   
DEBU OpenShift version: 4.15.10                   
DEBU Podman version: 4.4.4                        
DEBU Running 'crc status'                         
CRC VM:          Running
OpenShift:       Unreachable (v4.15.10)
RAM Usage:       2.12GB of 14.66GB
Disk Usage:      19.89GB of 32.68GB (Inside the CRC VM)
Cache Usage:     27.27GB

in crc.log

time="2024-05-14T13:00:32+02:00" level=debug msg="retry loop: attempt 220"
time="2024-05-14T13:00:32+02:00" level=debug msg="Running SSH command: timeout 5s oc get nodes --context admin --cluster crc --kubeconfig /opt/kubeconfig"
time="2024-05-14T13:00:32+02:00" level=debug msg="SSH command results: err: Process exited with status 1, output: "
time="2024-05-14T13:00:32+02:00" level=debug msg="E0514 11:00:43.534563    9219 memcache.go:265] couldn't get current server API group list: Get \"https://api.crc.testing:6443/api?timeout=32s\": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host\nE0514 11:00:43.534956    9219 memcache.go:265] couldn't get current server API group list: Get \"https://api.crc.testing:6443/api?timeout=32s\": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host\nE0514 11:00:43.536383    9219 memcache.go:265] couldn't get current server API group list: Get \"https://api.crc.testing:6443/api?timeout=32s\": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host\nE0514 11:00:43.536667    9219 memcache.go:265] couldn't get current server API group list: Get \"https://api.crc.testing:6443/api?timeout=32s\": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host\nE0514 11:00:43.538035    9219 memcache.go:265] couldn't get current server API group list: Get \"https://api.crc.testing:6443/api?timeout=32s\": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host\nUnable to connect to the server: dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host\n"
time="2024-05-14T13:00:32+02:00" level=debug msg="error: Temporary error: ssh command error:\ncommand : timeout 5s oc get nodes --context admin --cluster crc --kubeconfig /opt/kubeconfig\nerr     : Process exited with status 1\n - sleeping 1s"

when sshing into the VM

[systemd]
Failed Units: 1
  qemu-guest-agent.service

trying to run a kubectl command

KUBECONFIG=/opt/kubeconfig kubectl get co
E0514 11:58:40.523111   11099 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0514 11:58:40.524245   11099 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0514 11:58:40.525767   11099 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0514 11:58:40.526177   11099 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0514 11:58:40.527603   11099 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
Unable to connect to the server: dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host

api.crc.testing resolves fine from the host but not within the crc instance

dig -ta api.crc.testing @192.168.130.11 +short
192.168.130.11

[core@crc ~]$ ping api.crc.testing
ping: api.crc.testing: Name or service not known
[core@crc ~]$ ping google.com
PING google.com (172.217.18.14) 56(84) bytes of data.
64 bytes from fra02s19-in-f14.1e100.net (172.217.18.14): icmp_seq=1 ttl=113 time=39.1 ms
64 bytes from fra02s19-in-f14.1e100.net (172.217.18.14): icmp_seq=2 ttl=113 time=55.1 ms
[core@crc ~]$ cat /etc/resolv.conf 
# Generated by NetworkManager
nameserver 192.168.130.1
[core@crc ~]$ dig -ta api.crc.testing

; <<>> DiG 9.16.23-RH <<>> -ta api.crc.testing
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 30775
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;api.crc.testing.       IN  A

;; AUTHORITY SECTION:
.           9866    IN  SOA a.root-servers.net. nstld.verisign-grs.com. 2024051400 1800 900 604800 86400

;; Query time: 2 msec
;; SERVER: 192.168.130.1#53(192.168.130.1)
;; WHEN: Tue May 14 12:07:49 UTC 2024
;; MSG SIZE  rcvd: 119

crc-org / crc