coreos / rpm-ostree

⚛📦 Hybrid image/package system with atomic upgrades and package layering
https://coreos.github.io/rpm-ostree
Other
874 stars 196 forks source link

Silverblue regression: EFI+Mac: bootloader write error #1380

Open twhiston opened 6 years ago

twhiston commented 6 years ago

Host system details

rpm-ostree status                                                                                                
State: idle; auto updates disabled
Deployments:
● ostree://pirate-28:fedora/28/x86_64/workstation
                   Version: 28.20180527.0 (2018-05-27 19:21:56)
                BaseCommit: 0bd9f83f4b0b849f7ee91ccf88a2abb10bee8ce0ecb7938bb71a5b327944ab2a
              GPGSignature: Valid signature by 128CF232A9371991C8A65695E08E7E629DB62FB1
           LayeredPackages: compat-ffmpeg28 docker-compose ffmpeg-libs flatpak-builder libselinux-python util-linux-user zsh
             LocalPackages: rpmfusion-free-release-28-1.noarch

Expected vs actual behavior

Currently I can't perform any actions due to an error with the Bootloader write config. Any ideas how I can fix that? Unfortunately I'm not at all sure how it happened.

rpm-ostree upgrade/install/uninstall
...
Writing rpmdb... done
Writing OSTree commit... done
Copying /etc changes: 25 modified, 0 removed, 81 added
error: Bootloader write config: unlink(/boot/efi/EFI/fedora/grub.cfg.new): Read-only file system
cgwalters commented 6 years ago

Does mount -o remount,rw /boot/efi help? Any obvious error messages in dmesg|grep -i efi or journalctl -b -r | grep -i efi?

twhiston commented 6 years ago

Hi, thanks for the reply sudo mount -o remount,rw /boot/efi always ends up remounting it as ro

grep /boot/efi /proc/mounts                                      
/dev/sda1 /boot/efi hfsplus ro,relatime,umask=22,uid=0,gid=0,nls=utf8 0 0

dmesg|grep -i efi doesnt show anything super obvious, to me at least, but did remind me to mention that this is running on a mac laptop where I also have a small osx partition (that I never use!)

dmesg|grep -i efi                                                
[    0.000000] efi: EFI v1.10 by Apple
[    0.000000] efi:  ACPI=0x78d8e000  ACPI 2.0=0x78d8e014  SMBIOS=0x78f8c000 
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.079966] pci 0000:00:02.0: BAR 2: assigned to efifb
[    0.100107] Registered efivars operations
[    0.807201] efifb: probing for efifb
[    0.807214] efifb: framebuffer at 0x90000000, using 20700k, total 20700k
[    0.807215] efifb: mode is 2880x1800x32, linelength=11776, pages=1
[    0.807216] efifb: scrolling: redraw
[    0.807217] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[    0.814320] fb0: EFI VGA frame buffer device
[    0.938418] Loaded UEFI:MokListRT cert 'Fedora Secure Boot CA: fde32599c2d61db1bf5807335d7b20e4cd963b42' linked to secondary sys keyring
[    1.695026] tsc: Refined TSC clocksource calibration: 2194.917 MHz
[    1.873456] fb: switching to inteldrmfb from EFI VGA

journalctl similarly offers no hints

Jun 02 13:34:28 localhost.localdomain sudo[4708]: twhiston : TTY=pts/0 ; PWD=/var/home/twhiston ; USER=root ; COMMAND=/bin/mount -o remount,rw /boot/efi
Jun 02 13:32:00 localhost.localdomain sudo[4154]: twhiston : TTY=pts/0 ; PWD=/var/home/twhiston ; USER=root ; COMMAND=/sbin/fsck /boot/efi
Jun 02 13:30:35 localhost kernel: fb: switching to inteldrmfb from EFI VGA
Jun 02 13:30:35 localhost kernel: tsc: Refined TSC clocksource calibration: 2194.917 MHz
Jun 02 13:30:35 localhost kernel: Loaded UEFI:MokListRT cert 'Fedora Secure Boot CA: fde32599c2d61db1bf5807335d7b20e4cd963b42' linked to secondary sys keyring
Jun 02 13:30:35 localhost kernel: fb0: EFI VGA frame buffer device
Jun 02 13:30:35 localhost kernel: efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
Jun 02 13:30:35 localhost kernel: efifb: scrolling: redraw
Jun 02 13:30:35 localhost kernel: efifb: mode is 2880x1800x32, linelength=11776, pages=1
Jun 02 13:30:35 localhost kernel: efifb: framebuffer at 0x90000000, using 20700k, total 20700k
Jun 02 13:30:35 localhost kernel: efifb: probing for efifb
Jun 02 13:30:35 localhost kernel: Registered efivars operations
Jun 02 13:30:35 localhost kernel: pci 0000:00:02.0: BAR 2: assigned to efifb
Jun 02 13:30:35 localhost kernel: clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
Jun 02 13:30:35 localhost kernel: efi:  ACPI=0x78d8e000  ACPI 2.0=0x78d8e014  SMBIOS=0x78f8c000 
Jun 02 13:30:35 localhost kernel: efi: EFI v1.10 by Apple
cgwalters commented 6 years ago

First, to clarify: is this a regression from an existing installation, or you did a new install and it didn't work? Since you had layered packages I assumed regression, but it's not clear.

/dev/sda1 /boot/efi hfsplus

I'm not a MacOS expert but that looks wrong, AFAIU the ESP should still be FAT. I'm not even sure about Linux support for writing to hfsplus; a quick glance at the kernel docs shows:

  force
    Used to force write access to volumes that are marked as journalled
    or locked.  Use at your own risk.

Do you have a /boot/efi/efi mount too? In this case it's probably a variant of https://bugzilla.redhat.com/show_bug.cgi?id=1575957 ?

twhiston commented 6 years ago

Hi, yes definitely a regression. I've been running atomic for a while, even rebasing between fedora 27 and 28, which continued to work perfectly until the last time I tried to update.

My /boot/efi looks as follows

drwxr-xr-x. 1 root   4 Jan  1  1970  EFI/
drwx------. 1 root   5 Apr 19 18:48  .fseventsd/
dr-xr-xr-t. 1 root   2 Mar 16 16:54 '.HFS+ Private Directory Data'$'\r'/
-rw-r--r--. 1 root  34 Jan  1  1970  mach_kernel
drwx------. 1 root   5 Mar 31 11:03  .Spotlight-V100/
drwxr-xr-x. 1 root   3 Jan  1  1970  System/
-rw-r--r--. 1 root 27K Mar 16 16:57  .VolumeIcon.icns

which looks very osx. inside the EFI folder I see

cd /boot/efi/EFI                                                 
total 0
drwxr-xr-x. 1 root  5 Jan  1  1970 BOOT/
drwxr-xr-x. 1 root 18 May 28 11:42 fedora/
cgwalters commented 6 years ago

What's your /etc/fstab?

I can't offhand think of something that changed in either ostree/rpm-ostree that could cause this - other possible culprits are potentially systemd and grub2 (as well as the kernel).

twhiston commented 6 years ago

fstab is as follows

#
# /etc/fstab
# Created by anaconda on Fri Mar 16 16:57:27 2018
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root /                       ext4    defaults        1 1
UUID=03c1f089-ecb7-4802-ad25-d90fd1098284 /boot                   ext4    defaults        1 2
UUID=bd8cd7b1-9b91-39fe-95c4-f3832cdad017 /boot/efi               hfsplus defaults        0 2
/dev/mapper/fedora-home /home                   ext4    defaults        1 2
/dev/mapper/fedora-var  /var                    ext4    defaults        1 2
/dev/mapper/fedora-swap swap                    swap    defaults        0 0
martinezjavier commented 6 years ago

@twhiston it's mounted as ro also when booting your previous OSTree deployment (where it was mounted as rw since you could upgrade)?

I'm not familiar with MacOS but have you seen this? AFAIU is that hfsplus can only be mounted as a rw if journaling is disabled. Could you boot into your MacOS partition and use the util mentioned there to disable journaling if is activated?

Another option is to useforce option when doing a remount as @cgwalters suggested:

sudo mount -o remount,rw,force /boot/efi

But the doc says Use at your own risk, so I would attempt the former.

twhiston commented 6 years ago

@martinezjavier journaling was already disabled and sudo mount -o remount,rw,force /boot/efi was remounting as ro anyway. However I went back to the previous deployment, first time it wouldn't let me upgrade, i rebooted it again and it sent me to emergency mode, after I continued from that I was allowed up upgrade it! So now my problem is solved, which is super cool :D, but I still have no idea why it happened in the first place.

cgwalters commented 6 years ago

I still suspect (but am not sure) that /boot/efi should really be vfat, not hfsplus. Now that the system is working again, is the mount table different?

twhiston commented 6 years ago

mount table still shows hfsplus, but now mounted with rw

☿ cat /etc/fstab       
#
# /etc/fstab
# Created by anaconda on Fri Mar 16 16:57:27 2018
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root /                       ext4    defaults        1 1
UUID=03c1f089-ecb7-4802-ad25-d90fd1098284 /boot                   ext4    defaults        1 2
UUID=bd8cd7b1-9b91-39fe-95c4-f3832cdad017 /boot/efi               hfsplus defaults        0 2
/dev/mapper/fedora-home /home                   ext4    defaults        1 2
/dev/mapper/fedora-var  /var                    ext4    defaults        1 2
/dev/mapper/fedora-swap swap                    swap    defaults        0 0

☿ grep /boot/efi /proc/mounts                                                                                                                                                            
/dev/sda1 /boot/efi hfsplus rw,relatime,umask=22,uid=0,gid=0,nls=utf8 0 0
twhiston commented 6 years ago

So this just happened again, and rolling back did not solve the issue this time, as above the mounts are back to showing /dev/sda1 /boot/efi hfsplus ro,relatime,umask=22,uid=0,gid=0,nls=utf8 0 0

cgwalters commented 6 years ago

The linux source I have here shows:

static int hfsplus_remount(struct super_block *sb, int *flags, char *data)
{
    sync_filesystem(sb);
    if ((bool)(*flags & SB_RDONLY) == sb_rdonly(sb))
        return 0;
    if (!(*flags & SB_RDONLY)) {
        struct hfsplus_vh *vhdr = HFSPLUS_SB(sb)->s_vhdr;
        int force = 0;

        if (!hfsplus_parse_options_remount(data, &force))
            return -EINVAL;

        if (!(vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_UNMNT))) {
            pr_warn("filesystem was not cleanly unmounted, running fsck.hfsplus is recommended.  leaving read-only.\n");
            sb->s_flags |= SB_RDONLY;
            *flags |= SB_RDONLY;
        } else if (force) {
            /* nothing */
        } else if (vhdr->attributes &
                cpu_to_be32(HFSPLUS_VOL_SOFTLOCK)) {
            pr_warn("filesystem is marked locked, leaving read-only.\n");
            sb->s_flags |= SB_RDONLY;
            *flags |= SB_RDONLY;
        } else if (vhdr->attributes &
                cpu_to_be32(HFSPLUS_VOL_JOURNALED)) {
            pr_warn("filesystem is marked journaled, leaving read-only.\n");
            sb->s_flags |= SB_RDONLY;
            *flags |= SB_RDONLY;
        }
    }
    return 0;
}

Based on that...it looks to me one cause of this can be the filesystem not being cleanly unmounted. Are you dual booting OS X? Are unclean shutdowns involved here?

cgwalters commented 6 years ago

And do any of the strings shown by pr_warn show up in your dmesg? Like dmesg |grep read-only

miabbott commented 6 years ago

I believe I've hit this as well using Fedora 28 Atomic Host on a Mac Mini.

Like @twhiston I've been able to upgrade this host before (I have two deployments listed), but the latest attempt to rpm-ostree upgrade failed writing out the bootloader.

Overall, my experience + details matches what has been reported already. Same error message, same mount of /boot/EFI using hfsplus. The only difference is that I am not dual-booting into OS X; I only have F28AH installed.

I will note that there probably has been one or two unclean shutdowns, as I've come to the office to find that little system mysteriously powered off. Perhaps due to a power blip or maybe even faulty hardware (this Mac Mini is circa 2010).

However, I was able to make some progress by unmounting /boot/efi, using fsck.hfsplus on the partition, and remounting /boot/efi as rw. This required me to use ostree admin unlock to temporarily install hfsplus-tools.

Of course, I decided to do this remotely and now the host isn't responding, so it might not have been a silver bullet solution...

I'll revisit this issue once I am back in the office and have physical access to the host.

# rpm-ostree status                                                                                                   
State: idle; auto updates enabled (check; last run 19h ago)
Deployments:                                                                                                                          
● ostree://fedora-atomic:fedora/28/x86_64/atomic-host          
                   Version: 28.20180527.0 (2018-05-27 19:05:29)
                    Commit: 291ea90da29bc5abe757b5a50813b3de1396b08412939a89b3b671aba9856093
              GPGSignature: Valid signature by 128CF232A9371991C8A65695E08E7E629DB62FB1

  ostree://fedora-atomic:fedora/28/x86_64/atomic-host              
                   Version: 28.20180515.1 (2018-05-15 16:32:35)     
                    Commit: a29367c58417c28e2bd8306c1f438b934df79eba13706e078fe8564d9e0eb32b
              GPGSignature: Valid signature by 128CF232A9371991C8A65695E08E7E629DB62FB1

Available update:                                                                              
        Version: 28.20180625.0 (2018-06-25 08:58:55)                                
         Commit: fbed0e26736fc189129f80e9547bfd71497377ca49f0cfd421f173667f5ea825
   GPGSignature: Valid signature by 128CF232A9371991C8A65695E08E7E629DB62FB1                        
           Diff: 104 upgraded, 2 removed, 1 added                                                  

# rpm-ostree upgrade                                           
1 metadata, 0 content objects fetched; 569 B transferred in 2 seconds                
Copying /etc changes: 15 modified, 0 removed, 40 added                
error: Bootloader write config: unlink(/boot/efi/EFI/fedora/grub.cfg.new): Read-only file system     

# sudo mount -o remount,rw,force /boot/efi                                 
# grep /boot/efi /proc/mounts                                                
/dev/sda2 /boot/efi hfsplus ro,relatime,umask=22,uid=0,gid=0,nls=utf8 0 0                

# dmesg | tail                                                                  
[678343.944765] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready                        
[678343.944804] IPv6: ADDRCONF(NETDEV_CHANGE): veth6f17f6f: link becomes ready                 
[678343.944836] docker0: port 1(veth6f17f6f) entered blocking state
[678343.944841] docker0: port 1(veth6f17f6f) entered forwarding state
[678346.244060] veth1bfeb7e: renamed from eth0
[678346.256425] docker0: port 1(veth6f17f6f) entered disabled state
[678346.377423] docker0: port 1(veth6f17f6f) entered disabled state
[678346.378412] device veth6f17f6f left promiscuous mode
[678346.378427] docker0: port 1(veth6f17f6f) entered disabled state
[678478.234184] hfsplus: filesystem was not cleanly unmounted, running fsck.hfsplus is recommended.  leaving read-only.

# ostree admin unlock                                                             
Development mode enabled.  A writable overlayfs is now mounted on /usr.                                
All changes there will be discarded on reboot.                  
[root@meatwad ~]# curl -LO https://kojipkgs.fedoraproject.org//packages/hfsplus-tools/540.1.linux3/15.fc28/x86_64/hfsplus-tools-540.1.linux3-15.fc28.x86_64.rpm
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  162k  100  162k    0     0   208k      0 --:--:-- --:--:-- --:--:--  208k
# rpm -i hfsplus-tools-540.1.linux3-15.fc28.x86_64.rpm

# umount /boot/efi                                           
# fsck.hfsplus /dev/sda2            
** /dev/sda2                                             
   Executing fsck_hfs (version 540.1-Linux).                   
** Checking non-journaled HFS Plus Volume.      
   The volume name is Linux HFS+ ESP                                    
** Checking extents overflow file.                                            
** Checking catalog file.                              
** Checking multi-linked files.                   
** Checking catalog hierarchy.                                      
** Checking extended attributes file.         
** Checking volume bitmap.                                         
** Checking volume information.                                    
** The volume Linux HFS+ ESP appears to be OK.     

# sudo mount -o rw /boot/efi
# grep /boot/efi /proc/mounts
/dev/sda2 /boot/efi hfsplus rw,relatime,umask=22,uid=0,gid=0,nls=utf8 0 0

# rpm-ostree upgrade
1 metadata, 0 content objects fetched; 569 B transferred in 2 seconds
Copying /etc changes: 15 modified, 0 removed, 40 added
Bootloader updated; bootconfig swap: yes; deployment count change: 0
        Freed: 255.5 MB (pkgcache branches: 0)
Upgraded:
  GeoIP-GeoLite-data 2018.04-1.fc28 -> 2018.06-1.fc28
  NetworkManager 1:1.10.8-1.fc28 -> 1:1.10.10-1.fc28
  NetworkManager-libnm 1:1.10.8-1.fc28 -> 1:1.10.10-1.fc28
  NetworkManager-team 1:1.10.8-1.fc28 -> 1:1.10.10-1.fc28
...
miabbott commented 6 years ago

Of course, I decided to do this remotely and now the host isn't responding, so it might not have been a silver bullet solution...

I'll revisit this issue once I am back in the office and have physical access to the host.

I got physical access to the host again, but wasn't getting any output from the HDMI out, so I hard rebooted the little guy. (I probably should have tried to connect a keyboard first, but in my morning haze just went with the hammer approach.) Thankfully, it came back up in the new, upgraded deployment and showed that /boot/efi was still mounted rw.

I suspect that the next time it suffers a hard power loss, the /boot/efi partition may end up corrupted/bad again and this process will need to be repeated.

twhiston commented 6 years ago

sorry for the slow reply on this (been sick for a while). I tried to solve this in the same way that @miabbott did but I cant install hfsplus-tools (when unlocked) because my base packages are not up to date. I always get the following message

rpm-ostree install hfsplus-tools                                
Checking out tree 11a2c57... done
Enabled rpm-md repositories: rpmfusion-free rpmfusion-free-updates fedora updates
rpm-md repo 'rpmfusion-free' (cached); generated: 2018-04-27 09:40:17
rpm-md repo 'rpmfusion-free-updates' (cached); generated: 2018-07-05 10:06:23
rpm-md repo 'fedora' (cached); generated: 2018-04-25 04:27:32
rpm-md repo 'updates' (cached); generated: 2018-07-04 18:09:06
Importing metadata [=============] 100%
Resolving dependencies... Forbidden base package replacements:
  glibc-all-langpacks 2.27-15.fc28 -> 2.27-19.fc28 (updates)
  ostree-libs 2018.5-1.fc28 -> 2018.6-3.fc28 (updates)
  glibc-common 2.27-15.fc28 -> 2.27-19.fc28 (updates)
  elfutils-libelf 0.171-1.fc28 -> 0.173-1.fc28 (updates)
  glibc 2.27-15.fc28 -> 2.27-19.fc28 (updates)
  elfutils-libs 0.171-1.fc28 -> 0.173-1.fc28 (updates)
  flatpak 0.11.7-1.fc28 -> 0.99.2-1.fc28 (updates)

anyone have any idea how I can move forward with this?

jlebon commented 6 years ago

See the fourth item in https://lists.projectatomic.io/projectatomic-archives/atomic-devel/2018-February/msg00029.html. Try rpm-ostree upgrade --install hfsplus-tools instead.

twhiston commented 6 years ago

@jlebon actually trying to install with rpm-ostree was a dead end here due to the /boot/efi ro mount. I had to get the package manually as @miabbott did above and then I was able to continue.

twhiston commented 6 years ago

However once I had rebooted after the fix and upgrade every single app I had installed from flathub was gone, very strange, but at least not too much work to fix!

jlebon commented 6 years ago

Ahh, I see you have a /var mountpoint in your fstab. You're likely hitting https://github.com/ostreedev/ostree/issues/1667. See workarounds there!

dustymabe commented 6 years ago
[root@localhost Downloads]# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● ostree://home:fedora/28/x86_64/workstation
                   Version: 28.20180907.0 (2018-09-07 15:00:58)
                    Commit: 15c1d3ca7f58f6874e982d6f254fd3abb1204b7976477101695ad8787007debd

just hit this same problem and I needed to get hfsplus-tools installed in order to be able to mount rw

[root@localhost Downloads]# grep /boot/efi /proc/mounts 
/dev/sda1 /boot/efi hfsplus ro,relatime,umask=22,uid=0,gid=0,nls=utf8 0 0
[root@localhost Downloads]# 
[root@localhost Downloads]# umount /boot/efi 
[root@localhost Downloads]# 
[root@localhost Downloads]# fsck.hfsplus /dev/sda1 
** /dev/sda1
   Executing fsck_hfs (version 540.1-Linux).
** Checking non-journaled HFS Plus Volume.
   The volume name is Linux HFS+ ESP
** Checking extents overflow file.
** Checking catalog file.
** Checking multi-linked files.
** Checking catalog hierarchy.
** Checking extended attributes file.
** Checking volume bitmap.
** Checking volume information.
** The volume Linux HFS+ ESP appears to be OK.
[root@localhost Downloads]# 
[root@localhost Downloads]# mount /boot/efi/
[root@localhost Downloads]# grep /boot/efi /proc/mounts 
/dev/sda1 /boot/efi hfsplus rw,relatime,umask=22,uid=0,gid=0,nls=utf8 0 0

Of course it was a pain to get hfsplus-tools installed. I'm going to open up a PR to silverblue to include the rpm in the base.

dustymabe commented 6 years ago

opened PR here: https://pagure.io/workstation-ostree-config/pull-request/111

@sanjabonic - can we add a section to a SB troubleshooting guide somewhere on this issue? I'm sure more people are going to hit it.

Once the above PR is merged and we have something on the troubleshooting guide, I'd say we can close this. @miabbott WDYT?

miabbott commented 6 years ago

@dustymabe Yeah, that sounds like a plan. I opened a ticket on fedora-docs/silverblue to get things tracked...I'll try to work on it in the near future - https://pagure.io/fedora-docs/silverblue/issue/20