Closed mgohashi closed 3 years ago
Clearly OpenShift does not start. have you tried to delete
and start
again?
You assign an unusual large amount of diskspace so I would like to remind you that CRC is supposed considered an ephemeral OpenShift development environment.
I've seen this as well with releases of CRC that are based off 4.6, and only if the cluster is allowed to operate for an extended period (like 1-3 days). I've seen this with Ubuntu 18.04.05 LTS and CentOS 8.2 - so also KVM hypervisor.
When restarting crc w/o using delete
the etcd pod is failing at the environment variable checking InitContainer with a log message of (after ssh to crc vm):
# cat etcd-crc-fdm75-master-0_openshift-etcd_etcd-ensure-env-vars-d5249245463ac1eadc44c03d0bd8217efd6a24f4f598c8e93c31831e5394be45.log
2020-12-22T00:35:21.651476376+00:00 stderr F Expected node IP to be 192.168.126.11 got 192.168.130.11
If crc is restarted with stop
then start
it will come up ok as this is done before the time when the API becomes unresponsive. This message only appears after a crc status
command shows that OpenShift is "Not Reachable".
@gbraad I got your point. I try to automate everything I can, but sometimes, it is important to have some state stored somewhere. I still have things that require a lot of manual steps to configure.
@timroster It seems that what you have said is really what is happening here. How do you ssh into the vm? Is there a known user and ssh key to use?
Thank you all guys.
Etcd reporting the wrong IP is not OK. This seems the network stack might not have the dummy device?
@praveenkumar @guillaumerose
@gbraad - I believe that the eth10
dummy device was present on the VM I was investigating. Setting up another instance to reproduce, may take some time to come to a state where it has failed.
Looking at the etcd operator code, I see that the IP used in the pod is the one you can see in kubectl describe nodes
under InternalIP.
(inside the VM) $ KUBECONFIG=/opt/kubeconfig kubectl describe nodes
...
Addresses:
InternalIP: 192.168.126.11
Hostname: crc-x2wnj-master-0
...
(inside the VM) $ sudo cat /etc/systemd/system/kubelet.service
...
ExecStart=/usr/bin/hyperkube \
kubelet \
--node-ip=192.168.126.11 \
--config=/etc/kubernetes/kubelet.conf \
...
If I'm not wrong, this IP is reported by kubelet. It would mean that something updated the systemd kubelet service with a different IP.
If it happens again, can you check this IP and the kubelet service ?
@guillaumerose this time it took a week to get the vm to this state. Note that the API server is unresponsive so one of the suggested checks fails.
Before stopping / restarting the VM, status:
dummy device (from /sbin/ip)
7: eth10: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether ea:de:48:29:c6:cd brd ff:ff:ff:ff:ff:ff
inet 192.168.126.11/24 brd 192.168.126.255 scope global noprefixroute eth10
valid_lft forever preferred_lft forever
inet6 fe80::dadd:a3bf:c2b9:813c/64 scope link noprefixroute
valid_lft forever preferred_lft forever
ping 192.168.126.11
PING 192.168.126.11 (192.168.126.11) 56(84) bytes of data.
64 bytes from 192.168.126.11: icmp_seq=1 ttl=64 time=0.301 ms
64 bytes from 192.168.126.11: icmp_seq=2 ttl=64 time=0.046 ms
hostname -I
192.168.130.11 192.168.126.11 10.88.0.1 10.116.0.1
KUBECONFIG=/opt/kubeconfig kubectl describe nodes
The connection to the server api.crc.testing:6443 was refused - did you specify the right host or port?
sudo cat /etc/systemd/system/kubelet.service
...
ExecStart=/usr/bin/hyperkube \
kubelet \
--node-ip=192.168.126.11 \
--config=/etc/kubernetes/kubelet.conf \
...
kubelet process is still running in the VM. looks like the etcd process has stopped.
it's not restarting due to failing checks in the ensure-env-vars
init container:
cat etcd-crc-fdm75-master-0_openshift-etcd_etcd-ensure-env-vars-e45f926772bf1f7fb39c230aa602968ebce4fe42a94eb84924b68e8c0abd7d92.log
2021-01-10T13:54:32.058994266+00:00 stderr F Expected node IP to be 192.168.126.11 got 192.168.130.11
I've not yet done a crc stop
; crc start
sequence if there are other things to check in the running vm.
Did more investigation, have reason to believe that this issue is related to the host's hypervisor. See analysis comparing working and non-working envs, showing lscpu and dmidecode cmd outputs. https://github.com/code-ready/crc/issues/1830#issuecomment-757555803
I've observed this failure on Ubuntu LTS 18.04.5 LTS native on a Ryzen 1700X processor so in that case, there is no host hypervisor. However, the last time that the crc instance on that host went unresponsive was about the time that this issue was opened.
Ryzen aren't officially supported https://access.redhat.com/support/policy/amd And the Castle (Fyre) environment uses Ubuntu as the host for Nested Virtualization: https://access.redhat.com/solutions/58204 (in reply to rscher)
This has been reported several times to the IBM Cloud team to be a cause for issues. When the same tests were performed on a RHEL host, none of the issues showed.
@gbraad I believe OP is not using Castle, nor am I. Like the OP and many others have experienced, for some reason the etcd
in CRC is failing and not successfully restarting. I added to this issue symptoms at a lower level, namely the check in the initContainer for etcd is getting conflicting information from expectations in the CRC host vm.
I have observed these failures with KVM hypervisor on a non-nested configuration with AMD Ryzen 1000 series CPU with Ubuntu guest and also IBM Cloud VPC guest running CentOS 8.x. Others have added to #1830 for non Castle environments.
AMD Ryzen 1000 series CPU
All these tests were performed on a Ryzen-based system, right? https://access.redhat.com/support/policy/amd
If you 'own' the machines, please make sure to fully update the BIOS and microcode for the Ryzen CPUs: #1420 #1035 as several mainboards haven't been provided with the latest release for stable use.
Also, I do not believe the OP is having the same problem as you are experiencing. Please provide more info... follow the template. As on that issue i requested an issue specific to the problem, but never came. (Note: do not hijack an issue please, create a new one)
The reporter of #1830 and I have briefly emailed and that issue is unrelated to your findings.
@mgohashi do you still have issues with your environment?
@gbraad done, see #1888
This is on an Intel CPU, albeit with nested virtualization from the cloud provider (IBM Cloud Gen2 VSI)
General information
crc setup
before starting it (Yes/No)? YesCRC version
CRC status
CRC config
Host Operating System
Steps to reproduce
Expected
It is expected to have a normal start-up normally just like the previous version of crc
Actual
Logs
You can start crc with
crc start --log-level debug
to collect logs. Please consider posting this on http://gist.github.com/ and post the link in the issue. https://gist.github.com/mgohashi/34f26722ac2ca934832d4ace689b123f