Closed rscher closed 3 years ago
What was the situation before the start?
Error starting machine: Error in driver during machine start: Unable to determine VM's IP address, did it fail to boot?
Did a previous VM exist?
Analyzing the logs I only really notice:
time="2021-09-16T04:34:15-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:15-04:00\" level=debug msg=\"Starting VM crc\""
time="2021-09-16T04:34:15-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:15-04:00\" level=debug msg=\"Validating network\""
time="2021-09-16T04:34:15-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:15-04:00\" level=debug msg=\"Validating storage pool\""
time="2021-09-16T04:34:20-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:20-04:00\" level=debug msg=\"GetIP called for crc\""
time="2021-09-16T04:34:20-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:20-04:00\" level=debug msg=\"Getting current state...\""
time="2021-09-16T04:34:20-04:00" level=debug msg="(crc) DBG | time=\"2021-09-16T04:34:20-04:00\" level=debug msg=\"Waiting for machine to come up 0/60\""
it seems to setup crc
as network beforehand, etc. but fails to obtain an IP address from here. Is the VM really starting? Can you verify this with the console?
looking at a previous post, https://github.com/code-ready/crc/issues/1742
im seeing this warning in /var/log/libvirt/qemu/crc.log:
2021-09-16T08:34:15.361571Z qemu-kvm: -device cirrus-vga,id=video0,bus=pcie.0,addr=0x1: **warning: 'cirrus-vga' is deprecated, please use a different VGA card instead**
any ideas how to resolve this ?
-Russ
This looks like https://github.com/code-ready/crc/issues/1028 one.
warning: 'cirrus-vga' is deprecated, please use a different VGA card instead
This is a deprecation notice from the underlying virtualization (qemu-kvm) and not CRC. It does not affect the use at this moment.
Can you attach the full /var/log/libvirt/qemu/crc.log
? Does the VM exists after a failed crc start
(virsh -c qemu:///system list --all
)?
Logs https://gist.github.com/rscher/e958f85d95ef7939cd8408ca41ebd7aa
According to these logs it seems it is created and gets started.
and also
$ crc status
CRC VM: Running
would suggest so. Can you verify on a vnc console what the 'CRC' VM shows?
Running CRC on: VM
What is the underlying hypervisor here?
works on RHEL 7.9
Sounds awfully familiar to an earlier issue... hmmmm...
[crcuser@crcRH84 ~]$ virsh -c qemu:///system list --all
Id Name State
2 crc running
[crcuser@crcRH84 ~]$ ll ~/.crc/machines/crc/
total 212
-rw------- 1 crcuser crcuser 618 Sep 16 05:59 config.json
-rw------- 1 crcuser crcuser 384 Sep 16 05:59 id_ecdsa
-rw------- 1 crcuser crcuser 253 Sep 16 05:59 id_ecdsa.pub
-rw------- 1 crcuser crcuser 23 Sep 16 05:59 kubeadmin-password
-rw-r--r-- 1 qemu qemu **197104** Sep 16 05:59 crc.qcow2 <--- not correct size
[crcuser@crcRH84 ~]$ crc status --log-level debug
DEBU CodeReady Containers version: 1.32.1+4d40571f
DEBU OpenShift version: 4.8.5 (embedded in executable)
DEBU Running 'crc status'
DEBU Checking file: /home/crcuser/.crc/machines/crc/.crc-exist
DEBU Checking file: /home/crcuser/.crc/machines/crc/.crc-exist
DEBU Found binary path at /home/crcuser/.crc/bin/crc-driver-libvirt
DEBU Launching plugin server for driver libvirt
DEBU Plugin server listening at address 127.0.0.1:46839
DEBU () Calling .GetVersion
DEBU Using API Version 1
DEBU () Calling .SetConfigRaw
DEBU () Calling .GetMachineName
DEBU (crc) Calling .GetState
DEBU (crc) DBG | time="2021-09-16T06:34:25-04:00" level=debug msg="Getting current state..."
DEBU (crc) DBG | time="2021-09-16T06:34:25-04:00" level=debug msg="Fetching VM..."
DEBU (crc) Calling .GetBundleName
DEBU (crc) Calling .GetIP
DEBU (crc) DBG | time="2021-09-16T06:34:25-04:00" level=debug msg="GetIP called for crc"
DEBU (crc) DBG | time="2021-09-16T06:34:25-04:00" level=debug msg="Getting current state..."
DEBU Running SSH command: df -B1 --output=size,used,target /sysroot | tail -1
DEBU Using ssh private keys: [/home/crcuser/.crc/machines/crc/id_ecdsa /home/crcuser/.crc/cache/crc_libvirt_4.8.5/id_ecdsa_crc]
DEBU SSH command results: err: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain, output:
DEBU Cannot get root partition usage: ssh command error:
command : df -B1 --output=size,used,target /sysroot | tail -1
err : ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain\n
DEBU cannot get OpenShift status: stat /home/crcuser/.crc/machines/crc/kubeconfig: no such file or directory
DEBU Making call to close driver server
DEBU (crc) Calling .Close
DEBU (crc) DBG | time="2021-09-16T06:34:25-04:00" level=debug msg="Closing plugin on server side"
DEBU Successfully made call to close driver server
DEBU Making call to close connection to plugin binary
CRC VM: Running
OpenShift: Unreachable (v4.8.5)
**Disk Usage: 0B of 0B (Inside the CRC VM)** 0B ?
Cache Usage: 15.96GB
Cache Directory: /home/crcuser/.crc/cache
says -machine pc-q35-rhel8.2.0 , but its RHEL 8.4, does that matter ?
[crcuser@crcRH84 ~]$ psgrep qemu
qemu 7076 1 99 06:16 ? 00:16:31 /usr/libexec/qemu-kvm -name guest=crc,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-crc/master-key.aes **-machine pc-q35-rhel8.2.0**,accel=kvm,usb=off,dump-guest-core=off -cpu host,rdrand=off -m 32768 -overcommit mem-lock=off -smp 16,sockets=16,cores=1,threads=1 -uuid 83b00b7a-ee7a-4bb0-849c-4cf73eeb87f5 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=38,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot menu=off,strict=on -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 -device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 -device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 -device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 -device qemu-xhci,id=usb,bus=pci.2,addr=0x0 -blockdev {"driver":"file","filename":"/home/crcuser/.crc/cache/crc_libvirt_4.8.5/crc.qcow2","aio":"threads","node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-2-format","read-only":true,"driver":"qcow2","file":"libvirt-2-storage","backing":null} -blockdev {"driver":"file","filename":"/home/crcuser/.crc/machines/crc/crc.qcow2","aio":"threads","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":false,"driver":"qcow2","file":"libvirt-1-storage","backing":"libvirt-2-format"} -device virtio-blk-pci,scsi=off,bus=pci.3,addr=0x0,drive=libvirt-1-format,id=virtio-disk0,bootindex=1 -netdev tap,fd=40,id=hostnet0,vhost=on,vhostfd=41 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:fd:fc:07:21:82,bus=pci.1,addr=0x0 -chardev stdio,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:2 -device cirrus-vga,id=video0,bus=pcie.0,addr=0x1 -incoming defer -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.4,addr=0x0 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
[crcuser@crcRH84 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC-Rome Processor
Stepping: 0
CPU MHz: 2349.998
BogoMIPS: 4699.99
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities
/var/log/libvirt/qemu/crc.log
--
yea yea its nested virt with AMD EPYCS in castle env ... but works flawlessly on RHEL 7.9 see my previous post
AMD EPYCS in castle
They are AFAIK still hosted on an older Ubuntu LTS release. 18.04 IIRC. This has been causing earlier issues with 8.x on these Epycs.
Disk Usage: 0B of 0B (Inside the CRC VM) 0B ?
as it is unable to communicate with the VM. It does not come up as expected. It does not get an IP address... any possibility to see what the VM does?
sudo virsh console crc
Anything as output?
According to the qemu logs
[ 2776.010000] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
and a whole slew of other issues around the virtualization stack. This is identical as your colleague had at that time. It works in Fyre (Xeon CPUs), but not on castle (Epyc CPUs).
Probably sluggish due to these AMD EPYC cpu's ... Ref
... which might explain your observation of this being 'sluggish'.
I notice this is a problem with AMD Epyc after IOMMU gets enabled. Perhaps you can change the watchdog threshold to start with?
$ echo 20 > /proc/sys/kernel/watchdog_thresh
But this might also indicate there is a heavy workload on these machines. Is the host shared with other VMs ?
[crcuser@crcRH84 ~]$ sudo virsh console crc
Connected to domain crc
Escape character is ^]
error: internal error: character device console0 is not using a PTY
[crcuser@crcRH84 ~]$ sudo dmidecode -t processor -q | tail -20
Socket Designation: CPU f
Type: Central Processor
Family: Other
Manufacturer: QEMU
ID: 12 0F 80 00 FF FB 8B 07
**Version: pc-i440fx-bionic**
Voltage: Unknown
External Clock: Unknown
Max Speed: 2000 MHz
Current Speed: 2000 MHz
Status: Populated, Enabled
Upgrade: Other
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Core Count: 1
Core Enabled: 1
Thread Count: 1
Characteristics: None
well it used to not work at all in castle until recent versions, 1.26 and up, then became stable with the new code checking the operators ...
bionic
Ubuntu 18.04. We had a LOT of issues testing CRC there in combination with the AMD Epyc. The VM would lockup and stop responding.
@cfergeau any suggestions?
yep, and ive lost the battle with castle/fyre devIT to upgrade bionic (crappy canonical VPC used for low-cost solution long before the RH acquisition) or provide xeon chips to limited set of vm's for me ... all requests denied. this is hugely important to IBM and my team, need to push RH engineering to work some magic here.
side note, im also overseeing this in our Blue Diamond restricted VPC, VMWare-based RHEL host VMs with xeon chips, works great , stellar performance, also nested virtualization with RHEL host, VT-x
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
VMWare-based RHEL host VMs with xeon chips,
Interesting to hear. Most reports with VMware suffer from serious timing issues for CRC.
and ive lost the battle with castle/fyre
We have been informed that a migration will happen to replace these with RHEL as host. (but from second or even third account, so don't pin me on this).
VMWare-based RHEL host VMs with xeon chips,
Interesting to hear. Most reports with VMware suffer from serious timing issues for CRC. not compared to these AMD EPYC -based VM's in castle, at least a factor of 2.5 times slower, tested with the same crc /RHEL stack and application and kube event load. IMHO, I have no complaints with VMWare in general when compared to some in-house , open source , low-cost DIY private cloud implementations . maybe its high-time we try RH Virtualization/OpenStack as a solution to our VPC wars
and ive lost the battle with castle/fyre
We have been informed that a migration will happen to replace these with RHEL as host. (but from second or even third account, so don't pin me on this). Wont hold my breath, and if it does happen, when ? I need this to be working ... yesterday ?
having said that, I need to escalate this issue to RH/openShift Engineering (Germany) ASAP and get some more eyes on it, its not holding up. We have been having this "nested virtualization on older ubuntu" story for well over a year now, It worked again , starting on 1.26 , OPC 4.7.8 on RHEL 7.9 .. since you put in all those operator checks at the end to check the stability before reporting the cluster is ready. and quite stable, MTBF over 3 months , pods stayed up. my script always wgets the latest crc client and does a build , even patches cluster to allow crc's local insecure-registry , deploys our standard app with 4 pods , so it keeps up with each of the clients and results have been good until now, and just broke using latest crc in RHEL 8.4 but seems ok on RHEL 7.9. the same build works great in our other VPC, VMWare-based xeon for both RHEL 7.9 and RHEL 8.4 , which is also "nested virtualization" the last time I checked. I recommend that we work together alongside with RH /OpenShift engineering in Germany to get this solved, it ain't going away, I can guarantee you that .
Have you also been working with ubuntu support to make sure that there are no known nested virtualization known issues on their side as well? Is the team managing the ubuntu virtualization hosts also involved in solving this in case testing of patched ubuntu kernels/qemus is needed?
Have you also been working with ubuntu support to make sure that there are no known nested virtualization known issues on their side as well? Is the team managing the ubuntu virtualization hosts also involved in solving this in case testing of patched ubuntu kernels/qemus is needed?
hey @cfergeau , no one is working with Canonical on this issue, and our DevIT folks (Castle maintenance support) claim pc-i440fx-bionic has no issues other than when I submit a ticket for this crc issue ... since the RH acq, they are somewhat low profile that thier setup is Canonical-based. it was designed/rolled-out long before the RH acq, by a consulting team from the UK ... hmm, maybe even canonical ppl. @gbraad also stated it will be replaced w RH eventually , ive heard rumors as well, but my timeline is now, not 1 year from now. i apprec all the help and support. I have RHOCP premium support as well, i can go harass them and they will escalate to Germany team. closing issue -Russ
I solved it by starting the libvirtd service:
2 crc ejecutando [user@fedora ~]$ systemctl status libvirtd ○ libvirtd.service - Virtualization daemon Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; disabled; preset: disabled) Active: inactive (dead) TriggeredBy: ○ libvirtd-admin.socket ○ libvirtd-tcp.socket ○ libvirtd.socket ○ libvirtd-tls.socket ○ libvirtd-ro.socket Docs: man:libvirtd(8) https://libvirt.org
[user@fedora ~]$ systemctl start libvirtd [user@fedora ~]$ systemctl status libvirtd ● libvirtd.service - Virtualization daemon Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; disabled; preset: disabled) Active: active (running) since Wed 2023-02-22 18:53:50 CST; 2s ago TriggeredBy: ● libvirtd-admin.socket ○ libvirtd-tcp.socket ● libvirtd.socket ○ libvirtd-tls.socket ● libvirtd-ro.socket Docs: man:libvirtd(8) https://libvirt.org Main PID: 5057 (libvirtd) Tasks: 20 (limit: 32768) Memory: 20.1M CPU: 362ms CGroup: /system.slice/libvirtd.service └─5057 /usr/sbin/libvirtd --timeout 120
feb 22 18:53:50 fedora systemd[1]: Starting libvirtd.service - Virtualization daemon... feb 22 18:53:50 fedora systemd[1]: Started libvirtd.service - Virtualization daemon. [user@fedora ~]$
General information
crc setup
before starting it (Yes/No)? yesCRC version
CRC config
Steps to reproduce
Expected
Actual
Logs
https://gist.github.com/rscher/e958f85d95ef7939cd8408ca41ebd7aa
Before gather the logs try following if that fix your issue