flatcar / Flatcar

Flatcar project repository for issue tracking, project documentation, etc.
https://www.flatcar.org/
Apache License 2.0
752 stars 32 forks source link

flatcar boot.mount fails after restart with 3815.2.0 #1417

Open aqilbeig opened 7 months ago

aqilbeig commented 7 months ago

Description

We are migrating our k8s workers to flatcar 3815.2.0; however, we found that boot.mount service fails in case the VM gets rebooted:

× boot.mount - Boot partition
     Loaded: loaded (/usr/lib/systemd/system/boot.mount; static)
     Active: failed (Result: exit-code) since Thu 2024-04-04 16:41:42 UTC; 9min ago
TriggeredBy: ● boot.automount
      Where: /boot
       What: /dev/disk/by-label/EFI-SYSTEM
        CPU: 3ms

Apr 04 16:41:42 ip-10-71-12-10.ec2.internal systemd[1]: Mounting boot.mount - Boot partition...
Apr 04 16:41:42 ip-10-71-12-10.ec2.internal mount[1892]: mount: /boot: unknown filesystem type 'vfat'.
Apr 04 16:41:42 ip-10-71-12-10.ec2.internal mount[1892]:        dmesg(1) may have more information after failed mount system call.
Apr 04 16:41:42 ip-10-71-12-10.ec2.internal systemd[1]: boot.mount: Mount process exited, code=exited, status=32/n/a
Apr 04 16:41:42 ip-10-71-12-10.ec2.internal systemd[1]: boot.mount: Failed with result 'exit-code'.
Apr 04 16:41:42 ip-10-71-12-10.ec2.internal systemd[1]: Failed to mount boot.mount - Boot partition.

Impact

This is impacting other services like systemd-boot-update or systemd-sysext and they are failing too which is turn making the node as NotReady after reboot

Failed Units: 6
  boot.mount
  bpf-insights.service
  crio.service
  systemd-boot-update.service
  systemd-sysext.service
× systemd-sysext.service - Merge System Extension Images into /usr/ and /opt/
     Loaded: loaded (/usr/lib/systemd/system/systemd-sysext.service; disabled; preset: disabled)
     Active: failed (Result: exit-code) since Thu 2024-04-04 16:41:42 UTC; 15min ago
       Docs: man:systemd-sysext.service(8)
    Process: 1873 ExecStart=systemd-sysext merge (code=exited, status=1/FAILURE)
   Main PID: 1873 (code=exited, status=1/FAILURE)
        CPU: 9ms

Apr 04 16:41:42 ip-10-71-12-10.ec2.internal systemd[1]: Starting systemd-sysext.service - Merge System Extension Images into /usr/ and /opt/...
Apr 04 16:41:42 ip-10-71-12-10.ec2.internal systemd-sysext[1873]: Failed to read metadata for image docker-flatcar: No such device
Apr 04 16:41:42 ip-10-71-12-10.ec2.internal systemd[1]: systemd-sysext.service: Main process exited, code=exited, status=1/FAILURE
Apr 04 16:41:42 ip-10-71-12-10.ec2.internal systemd[1]: systemd-sysext.service: Failed with result 'exit-code'.
Apr 04 16:41:42 ip-10-71-12-10.ec2.internal systemd[1]: Failed to start systemd-sysext.service - Merge System Extension Images into /usr/ and /opt/.

Flatcar version information:

ip-10-71-12-10 ~ # uname -a
Linux ip-10-71-12-10.ec2.internal 6.1.77-flatcar #1 SMP PREEMPT Mon Feb 12 21:16:07 -00 2024 aarch64 GNU/Linux
ip-10-71-12-10 ~ # cat /etc/os-release
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3815.2.0
VERSION_ID=3815.2.0
BUILD_ID=2024-02-12-2202
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3815.2.0 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="arm64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3815.2.0:*:*:*:*:*:*:*"

Environment and steps to reproduce

Expected behavior

boot.mount should be running after restart

Additional information

Please add any information here that does not fit the above format.

aqilbeig commented 7 months ago

dmesg.txt

jepio commented 7 months ago

Can you upload the full journal contents? sudo journalctl -b0

Can you share your ignition file as well?

jepio commented 7 months ago

Are you blocking modprobe somehow? This line from dmesg suggests something is wrong with module loading in general.

[    4.447521] request_module fs-squashfs succeeded, but still no fs?
aqilbeig commented 7 months ago

output of cat /etc/modprobe.d/blacklist.conf

blacklist cramfs  # CIS v2.0.0 1.1.1.1
blacklist freevxfs  # CIS v2.0.0 1.1.1.2
blacklist jffs2  # CIS v2.0.0 1.1.1.3
blacklist hfs  # CIS v2.0.0 1.1.1.4
blacklist hfsplus  # CIS v2.0.0 1.1.1.5
# Docker and Containerd are now sysext images built with squashfs
# blacklist squashfs  # CIS v2.0.0 1.1.1.6
blacklist udf  # CIS v2.0.0 1.1.1.7
blacklist vfat  # CIS v2.0.0 1.1.1.8
blacklist usb-storage  # CIS v2.0.0 1.1.23
blacklist dccp  # CIS v2.0.0 3.4.1
blacklist sctp  # CIS v2.0.0 3.4.2
blacklist rds  # CIS v2.0.0 3.4.3
blacklist tipc  # CIS v2.0.0 3.4.4
jepio commented 7 months ago

Please remove these lines:

blacklist squashfs  # CIS v2.0.0 1.1.1.6
blacklist vfat  # CIS v2.0.0 1.1.1.8
jepio commented 7 months ago

And check that you don't also have an entry like this:

install squashfs /bin/true
aqilbeig commented 7 months ago

install squashfs /bin/true

cpt-master-ethos11thrashor1-890 ~ # cat /etc/modprobe.d/squashfs.conf install squashfs /bin/true

Do we have to remove it from here as well ^^

aqilbeig commented 7 months ago

@jepio thanks a lot for quick replies..

jepio commented 7 months ago

install squashfs /bin/true

cpt-master-ethos11thrashor1-890 ~ # cat /etc/modprobe.d/squashfs.conf install squashfs /bin/true

Do we have to remove it from here as well ^^

Yes definitely. These modifications are directly responsible for the errors you are seeing. Also remove anything that says this:

install vfat /bin/true

May I ask why you have these config files?

aqilbeig commented 7 months ago

This is because of the CIS standards we are following CIS-1.1.1.6 Ensure mounting of squashfs filesystems is disabled

jepio commented 7 months ago

Can you share more? How could I validate myself what change this CIS standard is requesting? And are all of these changes manually applied by you or is some tool generating the configs?

Please be careful with this kind of hardening approach, there may be more things here that subtly break your system.