Code Ready Containers error on restart

mgohashi commented 3 years ago

General information

OS: Linux
Hypervisor: KVM
Did you run crc setup before starting it (Yes/No)? Yes

CRC version

# Put the output of `crc version`
CodeReady Containers version: 1.20.0+ef3f80d
OpenShift version: 4.6.6 (embedded in executable)

CRC status

# Put the output of `crc status`
CRC VM:          Running
OpenShift:       Not Reachable (v4.6.6)
Disk Usage:      30.93GB of 267.9GB (Inside the CRC VM)
Cache Usage:     13.69GB
Cache Directory: /home/mohashi/.crc/cache

CRC config

# Put the output of `crc config view`
- cpus                                  : 8
- disk-size                             : 250
- memory                                : 28672
- pull-secret-file                      : /home/mohashi/Documents/pull-secret

Host Operating System

# Put the output of `cat /etc/os-release` in case of Linux
NAME=Fedora
VERSION="32 (Workstation Edition)"
ID=fedora
VERSION_ID=32
VERSION_CODENAME=""
PLATFORM_ID="platform:f32"
PRETTY_NAME="Fedora 32 (Workstation Edition)"
ANSI_COLOR="0;34"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:32"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f32/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=32
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=32
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Workstation Edition"
VARIANT_ID=workstation

Steps to reproduce

crc setup
crc start
Wait until it becomes not reachable (it may take days)

crc status (will start to get this output)

CRC VM:          Running
OpenShift:       Not Reachable (v4.6.6)
Disk Usage:      30.93GB of 267.9GB (Inside the CRC VM)
Cache Usage:     13.69GB
Cache Directory: /home/mohashi/.crc/cache

crc stop

crc start

Making call to close connection to plugin binary
Failed to update cluster pull secret: Temporary error: ssh command error:
command : oc get secret --context admin --cluster crc --kubeconfig /opt/kubeconfig
err     : Process exited with status 1
output  :  (x20)

Expected

It is expected to have a normal start-up normally just like the previous version of crc

Actual

Logs

You can start crc with crc start --log-level debug to collect logs. Please consider posting this on http://gist.github.com/ and post the link in the issue. https://gist.github.com/mgohashi/34f26722ac2ca934832d4ace689b123f

gbraad commented 3 years ago

Clearly OpenShift does not start. have you tried to delete and start again? You assign an unusual large amount of diskspace so I would like to remind you that CRC is supposed considered an ephemeral OpenShift development environment.

timroster commented 3 years ago

I've seen this as well with releases of CRC that are based off 4.6, and only if the cluster is allowed to operate for an extended period (like 1-3 days). I've seen this with Ubuntu 18.04.05 LTS and CentOS 8.2 - so also KVM hypervisor.

When restarting crc w/o using delete the etcd pod is failing at the environment variable checking InitContainer with a log message of (after ssh to crc vm):

# cat etcd-crc-fdm75-master-0_openshift-etcd_etcd-ensure-env-vars-d5249245463ac1eadc44c03d0bd8217efd6a24f4f598c8e93c31831e5394be45.log
2020-12-22T00:35:21.651476376+00:00 stderr F Expected node IP to be 192.168.126.11 got 192.168.130.11

If crc is restarted with stop then start it will come up ok as this is done before the time when the API becomes unresponsive. This message only appears after a crc status command shows that OpenShift is "Not Reachable".

mgohashi commented 3 years ago

@gbraad I got your point. I try to automate everything I can, but sometimes, it is important to have some state stored somewhere. I still have things that require a lot of manual steps to configure.

@timroster It seems that what you have said is really what is happening here. How do you ssh into the vm? Is there a known user and ssh key to use?

Thank you all guys.

gbraad commented 3 years ago

Etcd reporting the wrong IP is not OK. This seems the network stack might not have the dummy device?

@praveenkumar @guillaumerose

timroster commented 3 years ago

@gbraad - I believe that the eth10 dummy device was present on the VM I was investigating. Setting up another instance to reproduce, may take some time to come to a state where it has failed.

guillaumerose commented 3 years ago

Looking at the etcd operator code, I see that the IP used in the pod is the one you can see in kubectl describe nodes under InternalIP.

https://github.com/openshift/cluster-etcd-operator/blob/3ece45158cb6373a4f66e54f6806648cd5405b09/pkg/etcdenvvar/etcd_env.go#L180

(inside the VM) $ KUBECONFIG=/opt/kubeconfig kubectl describe nodes
...
Addresses:
  InternalIP:  192.168.126.11
  Hostname:    crc-x2wnj-master-0
...

(inside the VM) $ sudo cat /etc/systemd/system/kubelet.service
...
ExecStart=/usr/bin/hyperkube \
    kubelet \
      --node-ip=192.168.126.11 \
      --config=/etc/kubernetes/kubelet.conf \
...

If I'm not wrong, this IP is reported by kubelet. It would mean that something updated the systemd kubelet service with a different IP.

If it happens again, can you check this IP and the kubelet service ?

timroster commented 3 years ago

@guillaumerose this time it took a week to get the vm to this state. Note that the API server is unresponsive so one of the suggested checks fails.

Before stopping / restarting the VM, status:

dummy device (from /sbin/ip)
7: eth10: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether ea:de:48:29:c6:cd brd ff:ff:ff:ff:ff:ff
    inet 192.168.126.11/24 brd 192.168.126.255 scope global noprefixroute eth10
       valid_lft forever preferred_lft forever
    inet6 fe80::dadd:a3bf:c2b9:813c/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

ping 192.168.126.11
PING 192.168.126.11 (192.168.126.11) 56(84) bytes of data.
64 bytes from 192.168.126.11: icmp_seq=1 ttl=64 time=0.301 ms
64 bytes from 192.168.126.11: icmp_seq=2 ttl=64 time=0.046 ms

hostname -I
192.168.130.11 192.168.126.11 10.88.0.1 10.116.0.1 

KUBECONFIG=/opt/kubeconfig kubectl describe nodes
The connection to the server api.crc.testing:6443 was refused - did you specify the right host or port?

sudo cat /etc/systemd/system/kubelet.service
...
ExecStart=/usr/bin/hyperkube \
    kubelet \
      --node-ip=192.168.126.11 \
      --config=/etc/kubernetes/kubelet.conf \
...

kubelet process is still running in the VM. looks like the etcd process has stopped.

it's not restarting due to failing checks in the ensure-env-vars init container:

cat etcd-crc-fdm75-master-0_openshift-etcd_etcd-ensure-env-vars-e45f926772bf1f7fb39c230aa602968ebce4fe42a94eb84924b68e8c0abd7d92.log
2021-01-10T13:54:32.058994266+00:00 stderr F Expected node IP to be 192.168.126.11 got 192.168.130.11

I've not yet done a crc stop ; crc start sequence if there are other things to check in the running vm.

rscher commented 3 years ago

Did more investigation, have reason to believe that this issue is related to the host's hypervisor. See analysis comparing working and non-working envs, showing lscpu and dmidecode cmd outputs. https://github.com/code-ready/crc/issues/1830#issuecomment-757555803

timroster commented 3 years ago

I've observed this failure on Ubuntu LTS 18.04.5 LTS native on a Ryzen 1700X processor so in that case, there is no host hypervisor. However, the last time that the crc instance on that host went unresponsive was about the time that this issue was opened.

gbraad commented 3 years ago

Ryzen aren't officially supported https://access.redhat.com/support/policy/amd And the Castle (Fyre) environment uses Ubuntu as the host for Nested Virtualization: https://access.redhat.com/solutions/58204 (in reply to rscher)

This has been reported several times to the IBM Cloud team to be a cause for issues. When the same tests were performed on a RHEL host, none of the issues showed.

timroster commented 3 years ago

@gbraad I believe OP is not using Castle, nor am I. Like the OP and many others have experienced, for some reason the etcd in CRC is failing and not successfully restarting. I added to this issue symptoms at a lower level, namely the check in the initContainer for etcd is getting conflicting information from expectations in the CRC host vm.

I have observed these failures with KVM hypervisor on a non-nested configuration with AMD Ryzen 1000 series CPU with Ubuntu guest and also IBM Cloud VPC guest running CentOS 8.x. Others have added to #1830 for non Castle environments.

gbraad commented 3 years ago

AMD Ryzen 1000 series CPU

All these tests were performed on a Ryzen-based system, right? https://access.redhat.com/support/policy/amd

If you 'own' the machines, please make sure to fully update the BIOS and microcode for the Ryzen CPUs: #1420 #1035 as several mainboards haven't been provided with the latest release for stable use.

Also, I do not believe the OP is having the same problem as you are experiencing. Please provide more info... follow the template. As on that issue i requested an issue specific to the problem, but never came. (Note: do not hijack an issue please, create a new one)

The reporter of #1830 and I have briefly emailed and that issue is unrelated to your findings.

@mgohashi do you still have issues with your environment?

timroster commented 3 years ago

@gbraad done, see #1888

This is on an Intel CPU, albeit with nested virtualization from the cloud provider (IBM Cloud Gen2 VSI)

crc-org / crc