coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
262 stars 59 forks source link

`rdcore bind-boot` fails with "Error: Expected one vendor dir" #1116

Closed iluminae closed 2 years ago

iluminae commented 2 years ago

Describe the bug After install via PXE on some host models, encountering: Error: Expected one vendor dir on /dev/sda2, got 2 after reboot. I have 4 hosts, 2 poweredge r415 and 2 poweredge r210 II. This issue is happening on both r210s, but the same process works on the r415s.

Reproduction steps Steps to reproduce the behavior:

  1. PXE boot server with kernel and initrd and kernel commandline: console=ttyS1,115200n8 coreos.live.rootfs_url=%s coreos.inst.install_dev=%s coreos.inst.ignition_url=%s
  2. Tried install_dev at /dev/disk/by-id and simply /dev/sdX - both do the same.

Expected behavior After reboot, I expect it to boot.

Actual behavior Install appears fine, triggers reboot, boots with no output and reboots again, then boots into an error: coreos-boot-edit[900]: Error: Expected one vendor dir on /dev/sda2, got 2.

System details

Ignition config

variant: fcos                                                                                                                                                                        
version: 1.4.0
kernel_arguments:
  should_exist:
  - mitigations=off
  - selinux=0
  - console=ttyS1,115200n8
  should_not_exist:
  - console=ttyS0,115200n8
  - mitigations=auto,nosmt
passwd:
  users:
  - name: REDACTED
    password_hash: REDACTED
    ssh_authorized_keys:
    - REDACTED
    groups:
    - "sudo"
    - "wheel"
storage:
  files:
  - path: /etc/hostname
    mode: 420
    overwrite: true
    contents:
      inline: "dosd-00"
  - path: /etc/ssh/sshd_config.d/20-enable-passwords.conf
    mode: 0644
    overwrite: true
    contents:
      inline: |
        PasswordAuthentication yes
  - path: /etc/NetworkManager/system-connections/eno2.nmconnection
    mode: 0600
    overwrite: true
    contents:
      inline: |
        [connection]
        id=eno2
        type=ethernet
        interface-name=eno2
        [ethernet]
        mtu=9000
        [ipv4]
        address1=192.168.6.0/24
        dns-search=
        may-fail=false
        method=manual
        never-default=true
        route1=192.168.6.0/24,192.168.6.0
jlebon commented 2 years ago

Can you provide the full logs of the failing boot?

Install appears fine, triggers reboot, boots with no output and reboots again, then boots into an error

The double reboot here is expected. During the first reboot, no logs are available because the console kargs haven't been added yet. Very early, Ignition applies them and immediately reboots again. You can avoid the double reboot by having coreos-installer itself perform the karg modifications (using pxe customize for example). (It'd make sense to have it automatically apply kargs from the target Ignition config as an optimization. I think this was discussed $somewhere but I can't find it right now; will look for it and otherwise file an RFE. Edit: https://github.com/coreos/coreos-installer/issues/797)

Expected one vendor dir

This error comes from the first boot trying to "bind" the bootloader with the bootfs. The error message here is not super helpful (improvement in https://github.com/coreos/coreos-installer/pull/796), but it's saying that there are multiple directories in the EFI partition. This is not the case in the FCOS EFI partition.

Are there any other disks connected to the systems with a boot filesystem label? E.g. a previous OS installation. Usually, the error message in that condition should've been clearer, so it's possible there's something more subtle going on here.

iluminae commented 2 years ago

Yea I figured the double boot was not the issue. There are 3 disks I have attached to the system, but I have also tried completely disconnecting the other 2 before boot, same result.

I will try to get the logs, I am on a iDRAC with serial redirection and they look pretty rough.

From the emergency shell in the initrd, I see only the labels made by the installer:

:/root# ls /dev/disk/by-label/ -l                                                                                                                   
total 0
lrwxrwxrwx 1 root root  10 Mar  8 02:43 EFI-SYSTEM -> ../../sda2
lrwxrwxrwx 1 root root  10 Mar  8 02:43 boot -> ../../sda3
lrwxrwxrwx 1 root root  10 Mar  8 02:43 root -> ../../sda4

:/root# ls /dev/disk/by-partlabel/ -l
total 0
lrwxrwxrwx 1 root root  10 Mar  8 02:43 BIOS-BOOT -> ../../sda1
lrwxrwxrwx 1 root root  10 Mar  8 02:43 EFI-SYSTEM -> ../../sda2
lrwxrwxrwx 1 root root  10 Mar  8 02:43 boot -> ../../sda3
lrwxrwxrwx 1 root root  10 Mar  8 02:43 root -> ../../sda4

We have 3 partitions there related to booting, and the error is referring only to /dev/sda2. I have tried booting the system both with EFI and with BIOS booting, same outcome.

jlebon commented 2 years ago

Yea I figured the double boot was not the issue. There are 3 disks I have attached to the system, but I have also tried completely disconnecting the other 2 before boot, same result.

Interesting, thanks for testing that.

I will try to get the logs, I am on a iDRAC with serial redirection and they look pretty rough.

From the emergency shell in the initrd, I see only the labels made by the installer:

:/root# ls /dev/disk/by-label/ -l                                                                                                                   
total 0
lrwxrwxrwx 1 root root  10 Mar  8 02:43 EFI-SYSTEM -> ../../sda2
lrwxrwxrwx 1 root root  10 Mar  8 02:43 boot -> ../../sda3
lrwxrwxrwx 1 root root  10 Mar  8 02:43 root -> ../../sda4

:/root# ls /dev/disk/by-partlabel/ -l
total 0
lrwxrwxrwx 1 root root  10 Mar  8 02:43 BIOS-BOOT -> ../../sda1
lrwxrwxrwx 1 root root  10 Mar  8 02:43 EFI-SYSTEM -> ../../sda2
lrwxrwxrwx 1 root root  10 Mar  8 02:43 boot -> ../../sda3
lrwxrwxrwx 1 root root  10 Mar  8 02:43 root -> ../../sda4

We have 3 partitions there related to booting, and the error is referring only to /dev/sda2. I have tried booting the system both with EFI and with BIOS booting, same outcome.

From the emergency shell, can you mount /dev/sda2 somewhere and show the output of ls $mnt/EFI?

iluminae commented 2 years ago

Here you go:

:/root# find mnt
mnt
mnt/EFI
mnt/EFI/BOOT
mnt/EFI/BOOT/BOOTX64.EFI
mnt/EFI/BOOT/fbx64.efi
mnt/EFI/fedora
mnt/EFI/fedora/BOOTX64.CSV
mnt/EFI/fedora/shim.efi
mnt/EFI/fedora/shimx64.efi
mnt/EFI/fedora/grubx64.efi
mnt/EFI/fedora/mmx64.efi
mnt/EFI/fedora/grub.cfg
mnt/EFI/Dell
mnt/EFI/Dell/BootOptionCache
mnt/EFI/Dell/BootOptionCache/BootOptionCache.dat

Sorry I've had a silly time getting the boot log off the box, I just need to find a USB stick or something.

EDIT: I checked the same on the 2 servers that did work and yes, they are missing the Dell directory.

# on _working_ boxes
# find mnt/
mnt/
mnt/EFI
mnt/EFI/BOOT
mnt/EFI/BOOT/BOOTX64.EFI
mnt/EFI/BOOT/fbx64.efi
mnt/EFI/fedora
mnt/EFI/fedora/BOOTX64.CSV
mnt/EFI/fedora/shim.efi
mnt/EFI/fedora/shimx64.efi
mnt/EFI/fedora/grubx64.efi
mnt/EFI/fedora/mmx64.efi
mnt/EFI/fedora/grub.cfg
mnt/EFI/fedora/bootuuid.cfg

So - what adds that?

iluminae commented 2 years ago

I have tried to just delete that EFI/Dell directory from /dev/sda2 but it comes back every time it reboots. This is not some magic dell thing is it?

jlebon commented 2 years ago
mnt/EFI/Dell
mnt/EFI/Dell/BootOptionCache
mnt/EFI/Dell/BootOptionCache/BootOptionCache.dat

Ahh nice. Yup, this is the issue. We currently don't expect that.

This is not some magic dell thing is it?

Information on this is really scarce, so I can't say for sure, but yes it does smell a lot like a magic Dell thing.

Anyway, we should be more lax on our side here given this information, so I'll look at tweaking our heuristics.

jlebon commented 2 years ago

Anyway, we should be more lax on our side here given this information, so I'll look at tweaking our heuristics.

https://github.com/coreos/coreos-installer/pull/802

wkruse commented 2 years ago

This issue is also happening on Dell PowerEdge R630 and R640.

iluminae commented 2 years ago

hey @jlebon any word on when coreos/coreos-installer#802 is getting in? I had to fall back to a different OS for this class of host and I would like to get them all homogenous.

jlebon commented 2 years ago

Hi @iluminae, apologies for the delay. We had some CI issues but they should be fixed now. Once the patch is in, I'll make sure it ends up in next week's releases.

dustymabe commented 2 years ago

The fix for this went into testing stream release 35.20220327.2.0. Please try out the new release and report issues.

log1cb0mb commented 2 years ago

Issue occurring on FC640 and MX740c as well but tested with testing release mentioned on FC640 and seems to be fixed.

dustymabe commented 2 years ago

Thanks for reporting the info and success @log1cb0mb!

dustymabe commented 2 years ago

The fix for this went into stable stream release 35.20220327.3.0.