[BUG] Error starting machine: Error in driver during machine start: Unable to determine VM's IP address, did it fail to boot?

rscher commented 3 years ago

General information

OS: Linux
Hypervisor: KVM
Did you run crc setup before starting it (Yes/No)? yes
Running CRC on: VM

CRC version

# Put `crc version` output here
$ crc  version
CodeReady Containers version: 1.32.1+4d40571f
OpenShift version: 4.8.5 (embedded in executable)

## CRC status
```bash
# Put `crc status --log-level debug` output here
$ crc status
CRC VM:          Running
OpenShift:       Unreachable (v4.8.5)
Disk Usage:      0B of 0B (Inside the CRC VM)
Cache Usage:     15.96GB
Cache Directory: /home/crcuser/.crc/cache

CRC config

# Put `crc config view` output here
$ crc config view
- consent-telemetry                     : no
- cpus                                  : 16
- memory                                : 32768
- pull-secret-file                      : /home/crcuser/pull-secret.txt

## Host Operating System
```bash
# Put the output of `cat /etc/os-release` in case of Linux
# put the output of `sw_vers` in case of Mac
# Put the output of `systeminfo` in case of Windows
$ cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.4 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.4"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.4"

Steps to reproduce

$ crc setup --log-level debug
$ crc start --log-level debug

Expected

Actual

..
DEBU (crc) DBG | time="2021-09-16T04:37:17-04:00" level=debug msg="Getting current state..." 
DEBU (crc) DBG | time="2021-09-16T04:37:17-04:00" level=debug msg="Waiting for machine to come up 59/60" 
DEBU (crc) DBG | time="2021-09-16T04:37:20-04:00" level=warning msg="Unable to determine VM's IP address, did it fail to boot?" 
DEBU Making call to close driver server           
DEBU (crc) Calling .Close                         
DEBU Successfully made call to close driver server 
DEBU Making call to close connection to plugin binary 
DEBU Making call to close driver server           
DEBU (crc) Calling .Close                         
DEBU Successfully made call to close driver server 
DEBU (crc) DBG | time="2021-09-16T04:37:20-04:00" level=debug msg="Closing plugin on server side" 
DEBU Making call to close connection to plugin binary 
Error starting machine: Error in driver during machine start: Unable to determine VM's IP address, did it fail to boot?

Logs

https://gist.github.com/rscher/e958f85d95ef7939cd8408ca41ebd7aa

Before gather the logs try following if that fix your issue


$ crc delete -f
$ crc cleanup
$ crc setup
$ crc start --log-level debug

tried this, did not resolve 
works on RHEL 7.9

Please consider posting the output of `crc start --log-level debug`  on http://gist.github.com/ and post the link in the issue.

gbraad commented 3 years ago

What was the situation before the start?

Error starting machine: Error in driver during machine start: Unable to determine VM's IP address, did it fail to boot?

Did a previous VM exist?

gbraad commented 3 years ago

Analyzing the logs I only really notice:

time="2021-09-16T04:34:15-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:15-04:00\" level=debug msg=\"Starting VM crc\""
time="2021-09-16T04:34:15-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:15-04:00\" level=debug msg=\"Validating network\""
time="2021-09-16T04:34:15-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:15-04:00\" level=debug msg=\"Validating storage pool\""
time="2021-09-16T04:34:20-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:20-04:00\" level=debug msg=\"GetIP called for crc\""
time="2021-09-16T04:34:20-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:20-04:00\" level=debug msg=\"Getting current state...\""
time="2021-09-16T04:34:20-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:20-04:00\" level=debug msg=\"Waiting for machine to come up 0/60\""

it seems to setup crc as network beforehand, etc. but fails to obtain an IP address from here. Is the VM really starting? Can you verify this with the console?

rscher commented 3 years ago

looking at a previous post, https://github.com/code-ready/crc/issues/1742

im seeing this warning in /var/log/libvirt/qemu/crc.log:

2021-09-16T08:34:15.361571Z qemu-kvm: -device cirrus-vga,id=video0,bus=pcie.0,addr=0x1: **warning: 'cirrus-vga' is deprecated, please use a different VGA card instead**

any ideas how to resolve this ?

-Russ

praveenkumar commented 3 years ago

This looks like https://github.com/code-ready/crc/issues/1028 one.

gbraad commented 3 years ago

warning: 'cirrus-vga' is deprecated, please use a different VGA card instead

This is a deprecation notice from the underlying virtualization (qemu-kvm) and not CRC. It does not affect the use at this moment.

cfergeau commented 3 years ago

Can you attach the full /var/log/libvirt/qemu/crc.log? Does the VM exists after a failed crc start (virsh -c qemu:///system list --all)?

gbraad commented 3 years ago

Logs https://gist.github.com/rscher/e958f85d95ef7939cd8408ca41ebd7aa

According to these logs it seems it is created and gets started.

and also

$ crc status
CRC VM:          Running

would suggest so. Can you verify on a vnc console what the 'CRC' VM shows?

gbraad commented 3 years ago

Running CRC on: VM

What is the underlying hypervisor here?

works on RHEL 7.9

Sounds awfully familiar to an earlier issue... hmmmm...

rscher commented 3 years ago

[crcuser@crcRH84 ~]$ virsh -c qemu:///system list --all
 Id   Name   State

 2    crc    running

[crcuser@crcRH84 ~]$ ll ~/.crc/machines/crc/
total 212
-rw------- 1 crcuser crcuser    618 Sep 16 05:59 config.json
-rw------- 1 crcuser crcuser    384 Sep 16 05:59 id_ecdsa
-rw------- 1 crcuser crcuser    253 Sep 16 05:59 id_ecdsa.pub
-rw------- 1 crcuser crcuser     23 Sep 16 05:59 kubeadmin-password
-rw-r--r-- 1 qemu    qemu    **197104** Sep 16 05:59 crc.qcow2      <--- not correct size

[crcuser@crcRH84 ~]$ crc status --log-level debug
DEBU CodeReady Containers version: 1.32.1+4d40571f 
DEBU OpenShift version: 4.8.5 (embedded in executable) 
DEBU Running 'crc status'                         
DEBU Checking file: /home/crcuser/.crc/machines/crc/.crc-exist 
DEBU Checking file: /home/crcuser/.crc/machines/crc/.crc-exist 
DEBU Found binary path at /home/crcuser/.crc/bin/crc-driver-libvirt 
DEBU Launching plugin server for driver libvirt   
DEBU Plugin server listening at address 127.0.0.1:46839 
DEBU () Calling .GetVersion                       
DEBU Using API Version 1                          
DEBU () Calling .SetConfigRaw                     
DEBU () Calling .GetMachineName                   
DEBU (crc) Calling .GetState                      
DEBU (crc) DBG | time="2021-09-16T06:34:25-04:00" level=debug msg="Getting current state..." 
DEBU (crc) DBG | time="2021-09-16T06:34:25-04:00" level=debug msg="Fetching VM..." 
DEBU (crc) Calling .GetBundleName                 
DEBU (crc) Calling .GetIP                         
DEBU (crc) DBG | time="2021-09-16T06:34:25-04:00" level=debug msg="GetIP called for crc" 
DEBU (crc) DBG | time="2021-09-16T06:34:25-04:00" level=debug msg="Getting current state..." 
DEBU Running SSH command: df -B1 --output=size,used,target /sysroot | tail -1 
DEBU Using ssh private keys: [/home/crcuser/.crc/machines/crc/id_ecdsa /home/crcuser/.crc/cache/crc_libvirt_4.8.5/id_ecdsa_crc] 
DEBU SSH command results: err: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain, output:  
DEBU Cannot get root partition usage: ssh command error:
command : df -B1 --output=size,used,target /sysroot | tail -1
err     : ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain\n 
DEBU cannot get OpenShift status: stat /home/crcuser/.crc/machines/crc/kubeconfig: no such file or directory 
DEBU Making call to close driver server           
DEBU (crc) Calling .Close                         
DEBU (crc) DBG | time="2021-09-16T06:34:25-04:00" level=debug msg="Closing plugin on server side" 
DEBU Successfully made call to close driver server 
DEBU Making call to close connection to plugin binary 
CRC VM:          Running
OpenShift:       Unreachable (v4.8.5)
**Disk Usage:      0B of 0B (Inside the CRC VM)**       0B ?
Cache Usage:     15.96GB
Cache Directory: /home/crcuser/.crc/cache

says -machine pc-q35-rhel8.2.0 , but its RHEL 8.4, does that matter ?

[crcuser@crcRH84 ~]$ psgrep  qemu
qemu        7076       1 99 06:16 ?        00:16:31 /usr/libexec/qemu-kvm -name guest=crc,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-crc/master-key.aes **-machine pc-q35-rhel8.2.0**,accel=kvm,usb=off,dump-guest-core=off -cpu host,rdrand=off -m 32768 -overcommit mem-lock=off -smp 16,sockets=16,cores=1,threads=1 -uuid 83b00b7a-ee7a-4bb0-849c-4cf73eeb87f5 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=38,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot menu=off,strict=on -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 -device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 -device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 -device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 -device qemu-xhci,id=usb,bus=pci.2,addr=0x0 -blockdev {"driver":"file","filename":"/home/crcuser/.crc/cache/crc_libvirt_4.8.5/crc.qcow2","aio":"threads","node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-2-format","read-only":true,"driver":"qcow2","file":"libvirt-2-storage","backing":null} -blockdev {"driver":"file","filename":"/home/crcuser/.crc/machines/crc/crc.qcow2","aio":"threads","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":false,"driver":"qcow2","file":"libvirt-1-storage","backing":"libvirt-2-format"} -device virtio-blk-pci,scsi=off,bus=pci.3,addr=0x0,drive=libvirt-1-format,id=virtio-disk0,bootindex=1 -netdev tap,fd=40,id=hostnet0,vhost=on,vhostfd=41 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:fd:fc:07:21:82,bus=pci.1,addr=0x0 -chardev stdio,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:2 -device cirrus-vga,id=video0,bus=pcie.0,addr=0x1 -incoming defer -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.4,addr=0x0 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on

[crcuser@crcRH84 ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC-Rome Processor
Stepping:            0
CPU MHz:             2349.998
BogoMIPS:            4699.99
Virtualization:      AMD-V
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities

/var/log/libvirt/qemu/crc.log

https://gist.githubusercontent.com/rscher/e26c022b3cbcb4c2095652ada26756aa/raw/8193f350017d65eff0251fad973ffbfb9802c642/gistfile1.txt

--

yea yea its nested virt with AMD EPYCS in castle env ... but works flawlessly on RHEL 7.9 see my previous post

https://github.com/code-ready/crc/issues/2704

gbraad commented 3 years ago

AMD EPYCS in castle

They are AFAIK still hosted on an older Ubuntu LTS release. 18.04 IIRC. This has been causing earlier issues with 8.x on these Epycs.

gbraad commented 3 years ago

Disk Usage: 0B of 0B (Inside the CRC VM) 0B ?

as it is unable to communicate with the VM. It does not come up as expected. It does not get an IP address... any possibility to see what the VM does?

sudo virsh console crc

Anything as output?

gbraad commented 3 years ago

According to the qemu logs

[ 2776.010000] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]

and a whole slew of other issues around the virtualization stack. This is identical as your colleague had at that time. It works in Fyre (Xeon CPUs), but not on castle (Epyc CPUs).

Probably sluggish due to these AMD EPYC cpu's ... Ref

... which might explain your observation of this being 'sluggish'.

I notice this is a problem with AMD Epyc after IOMMU gets enabled. Perhaps you can change the watchdog threshold to start with?

$ echo 20 > /proc/sys/kernel/watchdog_thresh

But this might also indicate there is a heavy workload on these machines. Is the host shared with other VMs ?

rscher commented 3 years ago

[crcuser@crcRH84 ~]$ sudo virsh console crc
Connected to domain crc
Escape character is ^]
error: internal error: character device console0 is not using a PTY

[crcuser@crcRH84 ~]$ sudo dmidecode  -t processor  -q  | tail -20
        Socket Designation: CPU f
        Type: Central Processor
        Family: Other
        Manufacturer: QEMU
        ID: 12 0F 80 00 FF FB 8B 07
        **Version: pc-i440fx-bionic**
        Voltage: Unknown
        External Clock: Unknown
        Max Speed: 2000 MHz
        Current Speed: 2000 MHz
        Status: Populated, Enabled
        Upgrade: Other
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Core Count: 1
        Core Enabled: 1
        Thread Count: 1
        Characteristics: None

well it used to not work at all in castle until recent versions, 1.26 and up, then became stable with the new code checking the operators ...

gbraad commented 3 years ago

bionic

Ubuntu 18.04. We had a LOT of issues testing CRC there in combination with the AMD Epyc. The VM would lockup and stop responding.

gbraad commented 3 years ago

@cfergeau any suggestions?

rscher commented 3 years ago

yep, and ive lost the battle with castle/fyre devIT to upgrade bionic (crappy canonical VPC used for low-cost solution long before the RH acquisition) or provide xeon chips to limited set of vm's for me ... all requests denied. this is hugely important to IBM and my team, need to push RH engineering to work some magic here.

side note, im also overseeing this in our Blue Diamond restricted VPC, VMWare-based RHEL host VMs with xeon chips, works great , stellar performance, also nested virtualization with RHEL host, VT-x

gbraad commented 3 years ago

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource

gbraad commented 3 years ago

VMWare-based RHEL host VMs with xeon chips,

Interesting to hear. Most reports with VMware suffer from serious timing issues for CRC.

and ive lost the battle with castle/fyre

We have been informed that a migration will happen to replace these with RHEL as host. (but from second or even third account, so don't pin me on this).

rscher commented 3 years ago

VMWare-based RHEL host VMs with xeon chips,

Interesting to hear. Most reports with VMware suffer from serious timing issues for CRC. not compared to these AMD EPYC -based VM's in castle, at least a factor of 2.5 times slower, tested with the same crc /RHEL stack and application and kube event load. IMHO, I have no complaints with VMWare in general when compared to some in-house , open source , low-cost DIY private cloud implementations . maybe its high-time we try RH Virtualization/OpenStack as a solution to our VPC wars

and ive lost the battle with castle/fyre

We have been informed that a migration will happen to replace these with RHEL as host. (but from second or even third account, so don't pin me on this). Wont hold my breath, and if it does happen, when ? I need this to be working ... yesterday ?

having said that, I need to escalate this issue to RH/openShift Engineering (Germany) ASAP and get some more eyes on it, its not holding up. We have been having this "nested virtualization on older ubuntu" story for well over a year now, It worked again , starting on 1.26 , OPC 4.7.8 on RHEL 7.9 .. since you put in all those operator checks at the end to check the stability before reporting the cluster is ready. and quite stable, MTBF over 3 months , pods stayed up. my script always wgets the latest crc client and does a build , even patches cluster to allow crc's local insecure-registry , deploys our standard app with 4 pods , so it keeps up with each of the clients and results have been good until now, and just broke using latest crc in RHEL 8.4 but seems ok on RHEL 7.9. the same build works great in our other VPC, VMWare-based xeon for both RHEL 7.9 and RHEL 8.4 , which is also "nested virtualization" the last time I checked. I recommend that we work together alongside with RH /OpenShift engineering in Germany to get this solved, it ain't going away, I can guarantee you that .

cfergeau commented 3 years ago

Have you also been working with ubuntu support to make sure that there are no known nested virtualization known issues on their side as well? Is the team managing the ubuntu virtualization hosts also involved in solving this in case testing of patched ubuntu kernels/qemus is needed?

rscher commented 3 years ago

Have you also been working with ubuntu support to make sure that there are no known nested virtualization known issues on their side as well? Is the team managing the ubuntu virtualization hosts also involved in solving this in case testing of patched ubuntu kernels/qemus is needed?

hey @cfergeau , no one is working with Canonical on this issue, and our DevIT folks (Castle maintenance support) claim pc-i440fx-bionic has no issues other than when I submit a ticket for this crc issue ... since the RH acq, they are somewhat low profile that thier setup is Canonical-based. it was designed/rolled-out long before the RH acq, by a consulting team from the UK ... hmm, maybe even canonical ppl. @gbraad also stated it will be replaced w RH eventually , ive heard rumors as well, but my timeline is now, not 1 year from now. i apprec all the help and support. I have RHOCP premium support as well, i can go harass them and they will escalate to Germany team. closing issue -Russ

crackertux commented 1 year ago

I solved it by starting the libvirtd service:

user@fedora ~]$ sudo virsh list

2 crc ejecutando [user@fedora ~]$ systemctl status libvirtd ○ libvirtd.service - Virtualization daemon Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; disabled; preset: disabled) Active: inactive (dead) TriggeredBy: ○ libvirtd-admin.socket ○ libvirtd-tcp.socket ○ libvirtd.socket ○ libvirtd-tls.socket ○ libvirtd-ro.socket Docs: man:libvirtd(8) https://libvirt.org

[user@fedora ~]$ systemctl start libvirtd [user@fedora ~]$ systemctl status libvirtd ● libvirtd.service - Virtualization daemon Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; disabled; preset: disabled) Active: active (running) since Wed 2023-02-22 18:53:50 CST; 2s ago TriggeredBy: ● libvirtd-admin.socket ○ libvirtd-tcp.socket ● libvirtd.socket ○ libvirtd-tls.socket ● libvirtd-ro.socket Docs: man:libvirtd(8) https://libvirt.org Main PID: 5057 (libvirtd) Tasks: 20 (limit: 32768) Memory: 20.1M CPU: 362ms CGroup: /system.slice/libvirtd.service └─5057 /usr/sbin/libvirtd --timeout 120

feb 22 18:53:50 fedora systemd[1]: Starting libvirtd.service - Virtualization daemon... feb 22 18:53:50 fedora systemd[1]: Started libvirtd.service - Virtualization daemon. [user@fedora ~]$

crc-org / crc