flatcar / Flatcar

Flatcar project repository for issue tracking, project documentation, etc.
https://www.flatcar.org/
Apache License 2.0
653 stars 27 forks source link

Unable to update Flatcar server - /boot full #1461

Open bignay2000 opened 1 month ago

bignay2000 commented 1 month ago

Description

update_engine_client -update incorrectly reports no update available when trying to update from stable 3815.2.1 to stable 3815.2.3.

flatcar-update --to-version 3815.2.3 correctly fails but with a generic "Error: update failed" error.

df -h shows /boot is 100% used. Found /boot/FSCK0000.REC 53 MB extra file as compared to other servers. /boot is only 127 MB total space.

jenkinsdockerslave-n1 ~ # update_engine_client -update
I0530 03:50:19.817471  2270 update_engine_client.cc:251] Initiating update check and install.
I0530 03:50:19.818826  2270 update_engine_client.cc:256] Waiting for update to complete.
LAST_CHECKED_TIME=1717041020
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_IDLE
NEW_VERSION=0.0.0
NEW_SIZE=0
I0530 03:50:24.913209  2270 update_engine_client.cc:194] No update available
jenkinsdockerslave-n1 ~ # flatcar-update --to-version 3815.2.3
Warning: found hardcoded GROUP=stable in /etc/flatcar/update.conf - make sure it fits the release channel you want to follow
Downloading update payloads...
When restarting after an error you may reuse them with '--to-payload /var/tmp/flatcar-update/flatcar_production_update.gz --extension ' (add --extension before each extension)
Forcing update...
Error: update failed
jenkinsdockerslave-n1 ~ # 
hiveadmin@jenkinsdockerslave-n1 ~ $ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           1.6G  532K  1.6G   1% /run
/dev/sda9        13G  898M   12G   8% /
sysext          3.9G   12K  3.9G   1% /usr
/dev/sda6       128M  936K  123M   1% /oem
overlay          13G  898M   12G   8% /etc
tmpfs           3.9G     0  3.9G   0% /media
tmpfs           3.9G   32K  3.9G   1% /tmp
/dev/sdf         32G   24K   30G   1% /srv/dockerlogs
/dev/sde        9.8G   24K  9.3G   1% /srv/dockervolumes
/dev/sdb         32G  9.7G   20G  33% /var/lib/docker
/dev/sdd         32G   24K   30G   1% /srv/dockerbackups
/dev/sdc        974M  208K  907M   1% /srv/dockercompose
/dev/sda1       127M  127M  1.0K 100% /boot

Impact

Server is configured to check for updates on Thursdays and if found apply and reboot. This server missed 2 updates, it should have automatically applied stable 3815.2.2 and then applied 3815.2.3

Unable to manually update to stable 3815.2.3.

Environment and steps to reproduce

Server was built on Sun Sep 3 16:57:33 2023 Flatcar Container Linux by Kinvolk stable 3815.2.1 Flatcar runs in a VM on Proxmox 8.2.2

  1. Create a Flatcar VM
  2. Create a file that puts /boot drive to 100%
  3. Reboot
  4. Try to update Flatcar via update_engine_client -update command
  5. Try to update Flatcar via flatcar-update --to-version 3815.2.3 command
  6. Update /etc/flatcar/update.conf with a LOCKSMITHD schedule and have this schedule trigger an update.

Note that FSCK can be configured to only run every X number of reboots, so may need to see if Flatcar runs FSCK on every boot or what triggers it.

Expected behavior

  1. If /boot is full then update_engine_client -update should return an out of disk space error
  2. Both update_engine_client -update and flatcar-update --to-version 3815.2.3 and the scheduled LOCKSMITHD update should have the same error handling and console output. Ideally they should inherit same function(s).
  3. FSCK .REC files should write to a different drive and path as /boot is only 127 MB.

Additional information

jenkinsdockerslave-n1 ~ # cat /etc/flatcar/update.conf
GROUP=stable
REBOOT_STRATEGY="reboot"
LOCKSMITHD_REBOOT_WINDOW_START="Thu 04:00"
LOCKSMITHD_REBOOT_WINDOW_LENGTH="1h"
ader1990 commented 1 month ago

Hello,

There are two issues here that need to be checked.

Firstly, as a baseline, I verified the upgrade path from 3815.2.1 to stable 3815.2.3, and the /boot was about 47% full before update and 89% full after update. This means that either your environment had an already 50% /boot partition used before update or something happened during the update process that consumed more space than it should have. Can you confirm that the used space of /boot before the update was around 47%? Also, if you can provide the output of this tree command to investigate what is consuming that space (more info on https://github.com/flatcar/scripts/pull/1731):

tree -sifF  /boot | sort -k2 -rn

Secondly, the update client should be fixed to error out and present a descriptive error message when this situation is happening, and then revert the update.

Can you please provide the url to the image you first used to see if the issue can be properly reproduced? I assume you have Proxmox with KVM as a hypervisor.

Thank you.

ader1990 commented 1 month ago

@bignay2000 you can install the tree unix binary from an Ubuntu 22.04 (apt install tree) and just copy it over to Flatcar via SCP (/usr/bin/tree), it will work out of the box as it is a small static binary.

bignay2000 commented 1 month ago

scp fails to be able to copy a file from my Mac to this server, so not able to get tree on the server.

I was able to do a "tar -cvpzf boot.tar.gz boot" and gunzip up the boot folder and transfer the file via http. Below is a screenshot.

image

bignay2000 commented 1 month ago

Any idea what this FSCK0000.REC file does and any idea why it's dated 1/1/1980?

image

bignay2000 commented 1 month ago

file system consistency check - fsck appears to have ran and left this file behind. https://en.wikipedia.org/wiki/Fsck - per this article, BTRS can trigger an automatic consistency check and then leave this behind.

So looks like their needs to be a third fix here to write *.REC files somewhere else on the drive?

bignay2000 commented 1 month ago

I am trying to leave this server broken so we can figure out root cause.. so don't want to fix the scp issue trying to get tree on the file system... Let me know if I can provide more information

ader1990 commented 1 month ago

I am trying to leave this server broken so we can figure out root cause.. so don't want to fix the scp issue trying to get tree on the file system... Let me know if I can provide more information

it is great that you found the file that gobbled up the space, tree was just a suggestion to reach the same outcome.

The only thing that I can think of is a data corruption issue that happened from unknown hardware or maybe network transitory issues.

Maybe @jepio has more insight on this change and if it could have changed anything if it is a software issue (https://github.com/flatcar/update_engine/commit/53262f6ac6e641777526815d35eff0154f688500#diff-3f98f459f8b031989459690e30f38f48446d57c3355aaec9badbb98621c4832dR53)?

journalctl might hold some information on the matter, although unlikely if the instance has been rebooted. Still, it is worth a try. Can you please share the output of journalctl full log? If not possible to share the full log, maybe you can grep for btrfs or fsck or REC words.

Thanks.

ader1990 commented 1 month ago

scp fails to be able to copy a file from my Mac to this server, so not able to get tree on the server.

I was able to do a "tar -cvpzf boot.tar.gz boot" and gunzip up the boot folder and transfer the file via http. Below is a screenshot.

image

Also, note the vmlinuz-a size, which I think it was vmlinuz-b during the upgrade -- it has a trimmed down size of only 16MB, which means that the fsck000 file was already there before the upgrade was tried.

jepio commented 1 month ago

@ader1990 the btrfs change is for mounting the /usr partitions, i don't think it can make a difference here.

@bignay2000 can you share the outputs of df -h and mount.

/boot/ is a FAT filesystem (has to be due to EFI requirements), the REC file can be created by fsck.vfat (https://man7.org/linux/man-pages/man8/fsck.vfat.8.html):

fsck0000.rec, fsck0001.rec, ...
           When recovering from a corrupted filesystem, fsck.fat dumps
           recovered data into files named fsckNNNN.rec in the top level
           directory of the filesystem.

Did you machine crash during an update? Check journalctl -g fsck.

Otherwise I don't think there is any more forensics to do here: remove the REC file (we may want to handle this case better by automatically cleaning them up) and then retry the update.

bignay2000 commented 4 weeks ago

hiveadmin@jenkinsdockerslave-n1 ~ $ mount proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel) devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=4096k,nr_inodes=1010827,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,size=1627260k,nr_inodes=819200,mode=755) cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime,seclabel) bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700) /dev/sda9 on / type ext4 (rw,relatime,seclabel) /dev/mapper/usr on /usr type btrfs (ro,relatime,seclabel,rescue=nologreplay,space_cache=v2,subvolid=5,subvol=/) /dev/sda6 on /oem type btrfs (rw,nodev,relatime,seclabel,space_cache=v2,subvolid=5,subvol=/) overlay on /etc type overlay (rw,noatime,seclabel,lowerdir=/sysroot/usr/share/flatcar/etc,upperdir=/sysroot/etc,workdir=/sysroot/.etc-work,metacopy=off) selinuxfs on /sys/fs/selinux type selinuxfs (rw,nosuid,noexec,relatime) systemd-1 on /boot type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=14618) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=12276) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel,pagesize=2M) mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime,seclabel) tmpfs on /media type tmpfs (rw,nosuid,nodev,noexec,relatime,seclabel) debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime,seclabel) tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime,seclabel) tmpfs on /tmp type tmpfs (rw,nosuid,nodev,seclabel,size=4068152k,nr_inodes=1048576) fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime) configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime) ramfs on /run/credentials/systemd-sysctl.service type ramfs (ro,nosuid,nodev,noexec,relatime,seclabel,mode=700) ramfs on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,seclabel,mode=700) ramfs on /run/credentials/systemd-tmpfiles-setup-dev.service type ramfs (ro,nosuid,nodev,noexec,relatime,seclabel,mode=700) /dev/sdd on /srv/dockerbackups type ext4 (rw,relatime,seclabel) /dev/sde on /srv/dockerlogs type ext4 (rw,relatime,seclabel) /dev/sdb on /var/lib/docker type ext4 (rw,relatime,seclabel) /dev/sdc on /srv/dockercompose type ext4 (rw,relatime,seclabel) /dev/sdf on /srv/dockervolumes type ext4 (rw,relatime,seclabel) sysext on /usr type overlay (ro,relatime,seclabel,lowerdir=/run/systemd/sysext/meta/usr:/run/systemd/sysext/extensions/docker-flatcar/usr:/run/systemd/sysext/extensions/containerd-flatcar/usr:/usr) ramfs on /run/credentials/systemd-tmpfiles-setup.service type ramfs (ro,nosuid,nodev,noexec,relatime,seclabel,mode=700) /dev/sda1 on /boot type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro) hiveadmin@jenkinsdockerslave-n1 ~ $

bignay2000 commented 4 weeks ago

hiveadmin@jenkinsdockerslave-n1 ~ $ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 4.0M 0 4.0M 0% /dev tmpfs 3.9G 0 3.9G 0% /dev/shm tmpfs 1.6G 528K 1.6G 1% /run /dev/sda9 13G 1.4G 11G 12% / sysext 3.9G 12K 3.9G 1% /usr /dev/sda6 128M 936K 123M 1% /oem overlay 13G 1.4G 11G 12% /etc tmpfs 3.9G 0 3.9G 0% /media tmpfs 3.9G 32K 3.9G 1% /tmp /dev/sdd 32G 24K 30G 1% /srv/dockerbackups /dev/sde 32G 124M 30G 1% /srv/dockerlogs /dev/sdb 32G 9.7G 20G 33% /var/lib/docker /dev/sdc 974M 208K 907M 1% /srv/dockercompose /dev/sdf 9.8G 24K 9.3G 1% /srv/dockervolumes /dev/sda1 127M 127M 1.0K 100% /boot hiveadmin@jenkinsdockerslave-n1 ~ $

bignay2000 commented 4 weeks ago

Deleted the /boot/FSCK0000.REC 53 MB file

jenkinsdockerslave-n1 /boot # df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           1.6G  544K  1.6G   1% /run
/dev/sda9        13G  1.4G   11G  12% /
sysext          3.9G   12K  3.9G   1% /usr
/dev/sda6       128M  936K  123M   1% /oem
overlay          13G  1.4G   11G  12% /etc
tmpfs           3.9G     0  3.9G   0% /media
tmpfs           3.9G   32K  3.9G   1% /tmp
/dev/sdd         32G   24K   30G   1% /srv/dockerbackups
/dev/sde         32G  124M   30G   1% /srv/dockerlogs
/dev/sdb         32G  9.7G   20G  33% /var/lib/docker
/dev/sdc        974M  208K  907M   1% /srv/dockercompose
/dev/sdf        9.8G   24K  9.3G   1% /srv/dockervolumes
**/dev/sda1       127M   74M   53M  59% /boot**
tmpfs           795M     0  795M   0% /run/user/0
jenkinsdockerslave-n1 /boot # 

Update found

jenkinsdockerslave-n1 /boot # update_engine_client -update
I0603 02:06:59.785533  3637 update_engine_client.cc:251] Initiating update check and install.
I0603 02:06:59.787294  3637 update_engine_client.cc:256] Waiting for update to complete.
LAST_CHECKED_TIME=1717380420
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_UPDATE_AVAILABLE
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.130196
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.270368
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.400529
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.540701
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.680874
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.821046
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.931182
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_FINALIZING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
I0603 02:07:49.896826  3637 update_engine_client.cc:198] Update succeeded -- reboot needed.
jenkinsdockerslave-n1 /boot # 

Disk Free after update

jenkinsdockerslave-n1 /boot # df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           1.6G  548K  1.6G   1% /run
/dev/sda9        13G  1.4G   11G  12% /
sysext          3.9G   12K  3.9G   1% /usr
/dev/sda6       128M  936K  123M   1% /oem
overlay          13G  1.4G   11G  12% /etc
tmpfs           3.9G     0  3.9G   0% /media
tmpfs           3.9G   32K  3.9G   1% /tmp
/dev/sdd         32G   24K   30G   1% /srv/dockerbackups
/dev/sde         32G  124M   30G   1% /srv/dockerlogs
/dev/sdb         32G  9.7G   20G  33% /var/lib/docker
/dev/sdc        974M  208K  907M   1% /srv/dockercompose
/dev/sdf        9.8G   24K  9.3G   1% /srv/dockervolumes
/dev/sda1       127M  111M   16M  88% /boot
tmpfs           795M     0  795M   0% /run/user/0
jenkinsdockerslave-n1 /boot # 
reboot

Updated successfully:

Flatcar Container Linux by Kinvolk stable 3815.2.3
hiveadmin@jenkinsdockerslave-n1 ~ $ cat /etc/os-release 
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3815.2.3
VERSION_ID=3815.2.3
BUILD_ID=2024-05-21-1124
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3815.2.3 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3815.2.3:*:*:*:*:*:*:*"
hiveadmin@jenkinsdockerslave-n1 ~ $ 
bignay2000 commented 4 weeks ago

I found that the /etc/ssh/sshd_config had an invalid line and no longer matched the ignitton settings. Once I updated to match, scp from the MacBook started working. So the SCP not working is out of scope of this ticket.

bignay2000 commented 4 weeks ago

zip of all journal logs starting Mar 03 23:13:28 to current date. all_journalctl_logs_jenkinsdockerslave-n1.txt.zip