Open bignay2000 opened 1 month ago
Hello,
There are two issues here that need to be checked.
Firstly, as a baseline, I verified the upgrade path from 3815.2.1 to stable 3815.2.3, and the /boot was about 47% full before update and 89% full after update. This means that either your environment had an already 50% /boot partition used before update or something happened during the update process that consumed more space than it should have. Can you confirm that the used space of /boot before the update was around 47%? Also, if you can provide the output of this tree command to investigate what is consuming that space (more info on https://github.com/flatcar/scripts/pull/1731):
tree -sifF /boot | sort -k2 -rn
Secondly, the update client should be fixed to error out and present a descriptive error message when this situation is happening, and then revert the update.
Can you please provide the url to the image you first used to see if the issue can be properly reproduced? I assume you have Proxmox with KVM as a hypervisor.
Thank you.
@bignay2000 you can install the tree
unix binary from an Ubuntu 22.04 (apt install tree) and just copy it over to Flatcar via SCP (/usr/bin/tree), it will work out of the box as it is a small static binary.
scp fails to be able to copy a file from my Mac to this server, so not able to get tree on the server.
I was able to do a "tar -cvpzf boot.tar.gz boot" and gunzip up the boot folder and transfer the file via http. Below is a screenshot.
Any idea what this FSCK0000.REC file does and any idea why it's dated 1/1/1980?
file system consistency check - fsck appears to have ran and left this file behind. https://en.wikipedia.org/wiki/Fsck - per this article, BTRS can trigger an automatic consistency check and then leave this behind.
So looks like their needs to be a third fix here to write *.REC files somewhere else on the drive?
I am trying to leave this server broken so we can figure out root cause.. so don't want to fix the scp issue trying to get tree on the file system... Let me know if I can provide more information
I am trying to leave this server broken so we can figure out root cause.. so don't want to fix the scp issue trying to get tree on the file system... Let me know if I can provide more information
it is great that you found the file that gobbled up the space, tree
was just a suggestion to reach the same outcome.
The only thing that I can think of is a data corruption issue that happened from unknown hardware or maybe network transitory issues.
Maybe @jepio has more insight on this change and if it could have changed anything if it is a software issue (https://github.com/flatcar/update_engine/commit/53262f6ac6e641777526815d35eff0154f688500#diff-3f98f459f8b031989459690e30f38f48446d57c3355aaec9badbb98621c4832dR53)?
journalctl
might hold some information on the matter, although unlikely if the instance has been rebooted. Still, it is worth a try. Can you please share the output of journalctl
full log? If not possible to share the full log, maybe you can grep for btrfs
or fsck
or REC
words.
Thanks.
scp fails to be able to copy a file from my Mac to this server, so not able to get tree on the server.
I was able to do a "tar -cvpzf boot.tar.gz boot" and gunzip up the boot folder and transfer the file via http. Below is a screenshot.
Also, note the vmlinuz-a size, which I think it was vmlinuz-b during the upgrade -- it has a trimmed down size of only 16MB, which means that the fsck000 file was already there before the upgrade was tried.
@ader1990 the btrfs change is for mounting the /usr
partitions, i don't think it can make a difference here.
@bignay2000 can you share the outputs of df -h
and mount
.
/boot/
is a FAT filesystem (has to be due to EFI requirements), the REC file can be created by fsck.vfat
(https://man7.org/linux/man-pages/man8/fsck.vfat.8.html):
fsck0000.rec, fsck0001.rec, ...
When recovering from a corrupted filesystem, fsck.fat dumps
recovered data into files named fsckNNNN.rec in the top level
directory of the filesystem.
Did you machine crash during an update? Check journalctl -g fsck
.
Otherwise I don't think there is any more forensics to do here: remove the REC
file (we may want to handle this case better by automatically cleaning them up) and then retry the update.
hiveadmin@jenkinsdockerslave-n1 ~ $ mount proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel) devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=4096k,nr_inodes=1010827,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,size=1627260k,nr_inodes=819200,mode=755) cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime,seclabel) bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700) /dev/sda9 on / type ext4 (rw,relatime,seclabel) /dev/mapper/usr on /usr type btrfs (ro,relatime,seclabel,rescue=nologreplay,space_cache=v2,subvolid=5,subvol=/) /dev/sda6 on /oem type btrfs (rw,nodev,relatime,seclabel,space_cache=v2,subvolid=5,subvol=/) overlay on /etc type overlay (rw,noatime,seclabel,lowerdir=/sysroot/usr/share/flatcar/etc,upperdir=/sysroot/etc,workdir=/sysroot/.etc-work,metacopy=off) selinuxfs on /sys/fs/selinux type selinuxfs (rw,nosuid,noexec,relatime) systemd-1 on /boot type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=14618) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=12276) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel,pagesize=2M) mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime,seclabel) tmpfs on /media type tmpfs (rw,nosuid,nodev,noexec,relatime,seclabel) debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime,seclabel) tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime,seclabel) tmpfs on /tmp type tmpfs (rw,nosuid,nodev,seclabel,size=4068152k,nr_inodes=1048576) fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime) configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime) ramfs on /run/credentials/systemd-sysctl.service type ramfs (ro,nosuid,nodev,noexec,relatime,seclabel,mode=700) ramfs on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,seclabel,mode=700) ramfs on /run/credentials/systemd-tmpfiles-setup-dev.service type ramfs (ro,nosuid,nodev,noexec,relatime,seclabel,mode=700) /dev/sdd on /srv/dockerbackups type ext4 (rw,relatime,seclabel) /dev/sde on /srv/dockerlogs type ext4 (rw,relatime,seclabel) /dev/sdb on /var/lib/docker type ext4 (rw,relatime,seclabel) /dev/sdc on /srv/dockercompose type ext4 (rw,relatime,seclabel) /dev/sdf on /srv/dockervolumes type ext4 (rw,relatime,seclabel) sysext on /usr type overlay (ro,relatime,seclabel,lowerdir=/run/systemd/sysext/meta/usr:/run/systemd/sysext/extensions/docker-flatcar/usr:/run/systemd/sysext/extensions/containerd-flatcar/usr:/usr) ramfs on /run/credentials/systemd-tmpfiles-setup.service type ramfs (ro,nosuid,nodev,noexec,relatime,seclabel,mode=700) /dev/sda1 on /boot type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro) hiveadmin@jenkinsdockerslave-n1 ~ $
hiveadmin@jenkinsdockerslave-n1 ~ $ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 4.0M 0 4.0M 0% /dev tmpfs 3.9G 0 3.9G 0% /dev/shm tmpfs 1.6G 528K 1.6G 1% /run /dev/sda9 13G 1.4G 11G 12% / sysext 3.9G 12K 3.9G 1% /usr /dev/sda6 128M 936K 123M 1% /oem overlay 13G 1.4G 11G 12% /etc tmpfs 3.9G 0 3.9G 0% /media tmpfs 3.9G 32K 3.9G 1% /tmp /dev/sdd 32G 24K 30G 1% /srv/dockerbackups /dev/sde 32G 124M 30G 1% /srv/dockerlogs /dev/sdb 32G 9.7G 20G 33% /var/lib/docker /dev/sdc 974M 208K 907M 1% /srv/dockercompose /dev/sdf 9.8G 24K 9.3G 1% /srv/dockervolumes /dev/sda1 127M 127M 1.0K 100% /boot hiveadmin@jenkinsdockerslave-n1 ~ $
Deleted the /boot/FSCK0000.REC 53 MB file
jenkinsdockerslave-n1 /boot # df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 4.0M 0 4.0M 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 1.6G 544K 1.6G 1% /run
/dev/sda9 13G 1.4G 11G 12% /
sysext 3.9G 12K 3.9G 1% /usr
/dev/sda6 128M 936K 123M 1% /oem
overlay 13G 1.4G 11G 12% /etc
tmpfs 3.9G 0 3.9G 0% /media
tmpfs 3.9G 32K 3.9G 1% /tmp
/dev/sdd 32G 24K 30G 1% /srv/dockerbackups
/dev/sde 32G 124M 30G 1% /srv/dockerlogs
/dev/sdb 32G 9.7G 20G 33% /var/lib/docker
/dev/sdc 974M 208K 907M 1% /srv/dockercompose
/dev/sdf 9.8G 24K 9.3G 1% /srv/dockervolumes
**/dev/sda1 127M 74M 53M 59% /boot**
tmpfs 795M 0 795M 0% /run/user/0
jenkinsdockerslave-n1 /boot #
Update found
jenkinsdockerslave-n1 /boot # update_engine_client -update
I0603 02:06:59.785533 3637 update_engine_client.cc:251] Initiating update check and install.
I0603 02:06:59.787294 3637 update_engine_client.cc:256] Waiting for update to complete.
LAST_CHECKED_TIME=1717380420
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_UPDATE_AVAILABLE
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.130196
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.270368
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.400529
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.540701
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.680874
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.821046
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.931182
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_FINALIZING
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
LAST_CHECKED_TIME=1717380420
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT
NEW_VERSION=3815.2.3
NEW_SIZE=458187266
I0603 02:07:49.896826 3637 update_engine_client.cc:198] Update succeeded -- reboot needed.
jenkinsdockerslave-n1 /boot #
Disk Free after update
jenkinsdockerslave-n1 /boot # df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 4.0M 0 4.0M 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 1.6G 548K 1.6G 1% /run
/dev/sda9 13G 1.4G 11G 12% /
sysext 3.9G 12K 3.9G 1% /usr
/dev/sda6 128M 936K 123M 1% /oem
overlay 13G 1.4G 11G 12% /etc
tmpfs 3.9G 0 3.9G 0% /media
tmpfs 3.9G 32K 3.9G 1% /tmp
/dev/sdd 32G 24K 30G 1% /srv/dockerbackups
/dev/sde 32G 124M 30G 1% /srv/dockerlogs
/dev/sdb 32G 9.7G 20G 33% /var/lib/docker
/dev/sdc 974M 208K 907M 1% /srv/dockercompose
/dev/sdf 9.8G 24K 9.3G 1% /srv/dockervolumes
/dev/sda1 127M 111M 16M 88% /boot
tmpfs 795M 0 795M 0% /run/user/0
jenkinsdockerslave-n1 /boot #
reboot
Updated successfully:
Flatcar Container Linux by Kinvolk stable 3815.2.3
hiveadmin@jenkinsdockerslave-n1 ~ $ cat /etc/os-release
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3815.2.3
VERSION_ID=3815.2.3
BUILD_ID=2024-05-21-1124
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3815.2.3 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3815.2.3:*:*:*:*:*:*:*"
hiveadmin@jenkinsdockerslave-n1 ~ $
I found that the /etc/ssh/sshd_config had an invalid line and no longer matched the ignitton settings. Once I updated to match, scp from the MacBook started working. So the SCP not working is out of scope of this ticket.
zip of all journal logs starting Mar 03 23:13:28 to current date. all_journalctl_logs_jenkinsdockerslave-n1.txt.zip
Description
update_engine_client -update incorrectly reports no update available when trying to update from stable 3815.2.1 to stable 3815.2.3.
flatcar-update --to-version 3815.2.3 correctly fails but with a generic "Error: update failed" error.
df -h shows /boot is 100% used. Found /boot/FSCK0000.REC 53 MB extra file as compared to other servers. /boot is only 127 MB total space.
Impact
Server is configured to check for updates on Thursdays and if found apply and reboot. This server missed 2 updates, it should have automatically applied stable 3815.2.2 and then applied 3815.2.3
Unable to manually update to stable 3815.2.3.
Environment and steps to reproduce
Server was built on Sun Sep 3 16:57:33 2023 Flatcar Container Linux by Kinvolk stable 3815.2.1 Flatcar runs in a VM on Proxmox 8.2.2
Note that FSCK can be configured to only run every X number of reboots, so may need to see if Flatcar runs FSCK on every boot or what triggers it.
Expected behavior
Additional information