coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
263 stars 59 forks source link

libvirt VM fails to boot #1636

Open ianb-mp opened 9 months ago

ianb-mp commented 9 months ago

Describe the bug

CoreOS fails to boot under some conditions, depending on libvirt configuration. When it fails to boot, it drops to emergency shell. I see this error in rdsosreport.txt (full report here rdsosreport.txt)

3.062339] localhost systemd[1]: Starting dbus-broker.service - D-Bus System Message Bus...
[    3.069511] localhost dbus-broker-launch[540]: Missing configuration file in /usr/share/dbus-1/system.conf +1: /usr/share/dbus-1/system.conf
[    3.070515] localhost dbus-broker-launch[540]: ERROR run @ ../src/launch/main.c +152: Return code 1
[    3.070515] localhost dbus-broker-launch[540]:       main @ ../src/launch/main.c +178
[    3.070515] localhost dbus-broker-launch[540]: Exiting due to fatal error: -131
[    3.075819] localhost systemd[1]: dbus-broker.service: Main process exited, code=exited, status=1/FAILURE
[    3.076716] localhost systemd[1]: dbus-broker.service: Failed with result 'exit-code'.
[    3.082023] localhost systemd[1]: Failed to start dbus-broker.service - D-Bus System Message Bus.

The network interface is down. I am able to configure the network interface manually from the emergency shell cli and send/receive traffic. I can mount the disk and read/write to it. So VM hardware seems fine.

Reproduction steps

Launch a new VM with these parameters:

virt-install \
--virt-type=kvm \
--name ianb-okd-bs \
--ram 4096 \
--disk path=/data/var/libvirt/images/ianb-okd-bs.qcow2,bus=virtio,size=100 \
--vcpus 2 \
--os-type linux \
--os-variant fedora-coreos-stable \
--network bridge=br0 \
--graphics vnc,password=xxxxxx \
--noautoconsole \
--pxe \
--boot hd,network \
--console pty,target_type=serial

Expected behavior

VM should boot normally without error.

Actual behavior

VM fails to boot and drops to emergency shell as described earlier.

I am able to boot successfully if I replace --os-variant fedora-coreos-stable with --os-variant rhl9. I've attached xmldumps for both versions below in 'additional information'.

I've tested on two separate VM hosts with different OS, hardware etc and the same issue occurred on both.

System details

Bare metal host, kvm/qemu guest VM

VM is booted via PXE boot)

VM Guest kernel cmdline:


BOOT_IMAGE=http://10.2.3.2/openshift/fedora-coreos-38.20231002.3.1-live-kernel-x86_64 initrd=http://10.2.3.2/openshift/fedora-coreos-38.20231002.3.1-live-initramfs.x86_64.img coreos.live.rootfs_url=http://10.2.3.2/openshift/fedora-coreos-38.20231002.3.1-live-rootfs.x86_64.img coreos.inst.install_dev=/dev/vda coreos.inst.ignition_url=http://10.2.3.2/openshift/bootstrap.ign rd.neednet=1 ip=dhcp BOOTIF=01-52-54-00-88-df-2f

VM Host environment:

$ virsh version
Compiled against library: libvirt 9.5.0
Using library: libvirt 9.5.0
Using API: QEMU 9.5.0
Running hypervisor: QEMU 8.0.0

$ uname -a
Linux bne-lab-vr-2 5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Nov 8 17:36:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

OS: Rocky Linux 9.3

Butane or Ignition config

No response

Additional information

xmldump - fedora-coreos-stable ```xml ianb-okd-bs 3c3087d7-037b-4554-8111-7323a6f08186 4194304 4194304 2 /machine hvm destroy destroy destroy /usr/bin/qemu-system-x86_64
xmldump - rhl9 ```xml ianb-okd-bs e16f329e-c783-4302-8e40-2da05fc1e934 4194304 4194304 2 /machine hvm destroy destroy destroy /usr/bin/qemu-system-x86_64
dustymabe commented 9 months ago

Can you reproduce this with official FCOS images? If so can you provide the full filename of the qcow2 you are using where you are seeing this problem?

dustymabe commented 9 months ago

Oh, I see you are using PXE. In that case maybe filenames and sha256sum of the kernel initrd and rootfs files.

ianb-mp commented 9 months ago

Can you reproduce this with official FCOS images? If so can you provide the full filename of the qcow2 you are using where you are seeing this problem?

AFAIK the images I'm using are official FCOS images, downloaded from here. The qcow2 image is on the VM host, located at: /data/var/libvirt/images/ianb-okd-bs.qcow2

sha256sums:

e602806b53e09a898079eaf5e216cc610dde19e8a3f8fb6598aae02286203cd7  fedora-coreos-38.20231002.3.1-live-initramfs.x86_64.img
af72f61a3571dca6515182320b37afedf881d5c51e1c2b07be39f227849d8bd8  fedora-coreos-38.20231002.3.1-live-kernel-x86_64
b80c63e56bdf19a42d0a6aa8062577ac5dd6cad7cff7e47aa2cc5d1d589589ee  fedora-coreos-38.20231002.3.1-live-rootfs.x86_64.img

I've also tried using latest v39 images, but saw the same issue:

e50a45a3da2face2a808cfccc3318f7d0a9ad8d04c3845f798dc19646bcf4138  fedora-coreos-39.20231119.3.0-live-initramfs.x86_64.img
3efd416a625571345cbb330b434272b99813a1146cc9b4eaf7bfc8e29317c986  fedora-coreos-39.20231119.3.0-live-kernel-x86_64
a7bbd175100d567cd7e27130834990a837b49bf3181d41106b9dd967d1c1058d  fedora-coreos-39.20231119.3.0-live-rootfs.x86_64.img
ianb-mp commented 9 months ago

I've been looking at the differences between the fedora-coreos-stable and rhl9 os variants, and noticed that rhl9 uses e1000 driver by default. I've tested using fedora-coreos-stable variant but setting e1000 for the NIC, and it works i.e.

--network bridge=br0,model=e1000

Furthermore, after using e1000 for the installation, I can shutdown the VM, swap back to virtio and boots CoreOS and network is working. So the question is why can't CoreOS make use of virtio model NIC during the installation phase?

As I said earlier, the virtio NIC does work when I manually configure it from the emergency shell cli e.g.

$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp1s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 52:54:00:88:df:2f brd ff:ff:ff:ff:ff:ff
$ ip link set enp1s0 up
$ ip addr add 10.8.55.101/24 dev enp1s0
$ ip route add default via 10.8.55.1
$ curl -I http://10.2.3.2
HTTP/1.1 200 OK
dustymabe commented 9 months ago

It's possible that there is some device initialization that is taking longer for one versus the other. Getting more logs of early boot would help. I would add rd.debug to the kernel command line which will put NetworkManager into debug mode in the initramfs and then compare the output between the e1000 and the virtio cases to see if there's anything to glean. Unfortunately rd.debug will produce a ton of output so getting down to the NetworkManager logs might be tricky. Good luck!

ianb-mp commented 9 months ago

Here are the two boot logs with rd.debug enabled: journalctl_e1000.txt - successful boot rdsosreport_virtio.txt - failed boot

I've cropped timestamp prefix from each file to make it more comparable with diff. I had a look but couldn't see anything conclusive. Hopefully someone with more experience in these things will be able to spot the problem!