Closed M1cha closed 2 years ago
Can you show the output of findmnt
from inside the container?
Can you show the output of
findmnt
from inside the container?
of course:
~ # findmnt
TARGET SOURCE FSTYPE OPTIONS
/ /var/lib/lxd/containers/nice-chamois/rootfs
shiftfs rw,relatime,passthrough=3
├─/run tmpfs tmpfs rw,nosuid,nodev,size=791024k,nr_i
├─/dev none tmpfs rw,relatime,size=492k,mode=755,ui
│ ├─/dev/fuse devtmpfs[/fuse] devtmpfs rw,nosuid,noexec,relatime,size=10
│ ├─/dev/net/tun devtmpfs[/net/tun] devtmpfs rw,nosuid,noexec,relatime,size=10
│ ├─/dev/mqueue mqueue mqueue rw,nosuid,nodev,noexec,relatime
│ ├─/dev/lxd tmpfs tmpfs rw,relatime,size=100k,mode=755,in
│ ├─/dev/.lxd-mounts tmpfs[/nice-chamois] tmpfs rw,relatime,size=100k,mode=711,in
│ ├─/dev/full devtmpfs[/full] devtmpfs rw,nosuid,noexec,relatime,size=10
│ ├─/dev/null devtmpfs[/null] devtmpfs rw,nosuid,noexec,relatime,size=10
│ ├─/dev/random devtmpfs[/random] devtmpfs rw,nosuid,noexec,relatime,size=10
│ ├─/dev/tty devtmpfs[/tty] devtmpfs rw,nosuid,noexec,relatime,size=10
│ ├─/dev/urandom devtmpfs[/urandom] devtmpfs rw,nosuid,noexec,relatime,size=10
│ ├─/dev/zero devtmpfs[/zero] devtmpfs rw,nosuid,noexec,relatime,size=10
│ ├─/dev/pts devpts devpts rw,nosuid,noexec,relatime,gid=720
│ ├─/dev/ptmx devpts[/ptmx] devpts rw,nosuid,noexec,relatime,gid=720
│ └─/dev/console devpts[/0] devpts rw,nosuid,noexec,relatime,gid=720
├─/proc proc proc rw,nosuid,nodev,noexec,relatime
│ ├─/proc/sys/kernel/random/boot_id
│ │ none[/.lxc-boot-id] tmpfs ro,nosuid,nodev,noexec,relatime,s
│ ├─/proc/sys/fs/binfmt_misc proc[/sys/fs/binfmt_misc] proc rw,nosuid,nodev,noexec,relatime
│ ├─/proc/cpuinfo lxcfs[/proc/cpuinfo] fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ ├─/proc/diskstats lxcfs[/proc/diskstats] fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ ├─/proc/loadavg lxcfs[/proc/loadavg] fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ ├─/proc/meminfo lxcfs[/proc/meminfo] fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ ├─/proc/stat lxcfs[/proc/stat] fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ ├─/proc/swaps lxcfs[/proc/swaps] fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ └─/proc/uptime lxcfs[/proc/uptime] fuse.lxcf rw,nosuid,nodev,relatime,user_id=
└─/sys sysfs sysfs rw,relatime
├─/sys/fs/fuse/connections sysfs[/fs/fuse/connections] sysfs rw,nosuid,nodev,noexec,relatime
├─/sys/fs/pstore pstore pstore rw,nosuid,nodev,noexec,relatime
├─/sys/kernel/debug debugfs debugfs rw,nosuid,nodev,noexec,relatime
│ └─/sys/kernel/debug/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime
├─/sys/kernel/security securityfs securityf rw,nosuid,nodev,noexec,relatime
├─/sys/kernel/tracing sysfs[/kernel/tracing] sysfs rw,nosuid,nodev,noexec,relatime
├─/sys/fs/cgroup none cgroup2 rw,nosuid,nodev,noexec,relatime
└─/sys/devices/system/cpu/online
lxcfs[/sys/devices/system/cpu/online]
fuse.lxcf rw,nosuid,nodev,relatime,user_id=
Ah yes, it is really shiftfs
and not idmapped mounts
. That is very odd because I would think that the 5.15
kernel on Alpine does support them.
That's because idmapped mounts are not supported by ZFS yet. ubuntu seems to be able to use shiftfs on top of ZFS though.
So you're using an Alpine
vm and the Alpine
vm uses zfs
as the root filesystem?
no I'm using a physical aarch64 device with a squashfs(+overlaytmpfs) rootfs and a zfs storage pool for LXD.
no I'm using a physical aarch64 device with a squashfs(+overlaytmpfs) rootfs and a zfs storage pool for LXD.
So you're running an Alpine
container on top of zfs
, right?
Oh you meant the guest, yes that's an images:alpine/3.16
container(not a VM)
Can you show me findmnt
on the physical aarch64
if that's not something you'd rather not share/
here you go: (I removed unrelated containers)
# findmnt
TARGET SOURCE FSTYPE OPTIONS
/ overlayfs overlay rw,rela
├─/sys sysfs sysfs rw,nosu
│ ├─/sys/kernel/security securityfs securit rw,nosu
│ ├─/sys/kernel/debug debugfs debugfs rw,nosu
│ │ └─/sys/kernel/debug/tracing tracefs tracefs rw,nosu
│ ├─/sys/fs/pstore pstore pstore rw,nosu
│ └─/sys/fs/cgroup none cgroup2 rw,nosu
├─/dev devtmpfs devtmpf rw,nosu
│ ├─/dev/pts devpts devpts rw,nosu
│ ├─/dev/shm shm tmpfs rw,nosu
│ └─/dev/mqueue mqueue mqueue rw,nosu
├─/proc proc proc rw,nosu
├─/media/root-ro /dev/mmcblk1p7 squashf ro,rela
├─/media/root-rw root-tmpfs tmpfs rw,rela
├─/run tmpfs tmpfs rw,nosu
├─/var /dev/sda1 ext4 rw,rela
│ ├─/var/lib/lxcfs lxcfs fuse.lx rw,nosu
│ ├─/var/lib/lxd/shmounts tmpfs tmpfs rw,rela
│ ├─/var/lib/lxd/devlxd tmpfs tmpfs rw,rela
│ ├─/var/lib/lxd/storage-pools/btrfs /dev/sda3 btrfs rw,rela
│ └─/var/lib/lxd/storage-pools/default/containers/nice-chamois
│ default/containers/nice-chamois
│ zfs rw,rela
├─/boot /dev/mmcblk1p1 vfat rw,rela
└─/media/config /dev/mmcblk1p5 ext4 rw,rela
I managed to create a cloud-init based x64 LXD VM where the issue can be reproduced. To be clear: The issue happens inside the LXD container inside the LXD VM.
config:
cloud-init.user-data: |
#cloud-config
write_files:
- path: /etc/lxc/default.conf
permissions: '0644'
content: |
lxc.net.0.type = empty
lxc.idmap = u 0 100000 1000000000
lxc.idmap = g 0 100000 1000000000
- path: /etc/subuid
permissions: '0644'
content: |
root:100000:1000000000
- path: /etc/subgid
permissions: '0644'
content: |
root:100000:1000000000
- path: /etc/local.d/cgroup-initscope.start
permissions: '0755'
content: |
#!/bin/sh
mkdir -m 0755 -p /sys/fs/cgroup/init.scope
- path: /etc/modules-load.d/shiftfs.conf
permissions: '0644'
content: |
shiftfs
runcmd:
- echo "https://dl-cdn.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories
- apk update
- apk upgrade
- apk add
apparmor
apparmor-profiles
apparmor-utils
chrony
cloud-utils-growpart
e2fsprogs
e2fsprogs-extra
eudev
eudev-netifnames
git
linux-virt-dev
lxcfs
lxd-feature
make
nftables
zfs
zfs-udev
- |
cat >> /etc/rc.conf <<EOF
rc_cgroup_mode="unified"
rc_logger="YES"
rc_parallel="YES"
EOF
- growpart /dev/sda 2
- resize2fs /dev/sda2
- rc-update del mdev sysinit
- rc-update add localmount sysinit
- rc-update add zfs-import sysinit
- rc-update add zfs-mount sysinit
- rc-update add cgroups boot
- rc-update add local boot
- rc-update add chronyd default
- rc-update add lxcfs default
- rc-update add lxd default
- git clone https://github.com/toby63/shiftfs-dkms.git -b k5.16
- make -C shiftfs-dkms
- ln -s /shiftfs-dkms/shiftfs.ko /lib/modules/$(uname -r)/
- depmod -a
to reproduce:
lxc launch --vm images:alpine/edge/cloud -c security.secureboot=false alpine < alpine-shiftfs-bug.yaml
lxc exec alpine -- sh
lxd init --auto --storage-backend zfs
lxc launch images:alpine/3.16 a1
lxc exec a1 -- sh
touch /testfile
I'm closing this issue, not because we don't care about but because it's not a LXD bug.
I'm sure @brauner will still keep the chat going on here. Given we're not seeing this on Ubuntu, I wonder if it may be an incorrect port to 5.16 (Ubuntu is on 5.15)?
I don't know how easy it would be for you to transplant an Ubuntu 5.15 kernel onto your Alpine VM, but if doable, that'd be an easy way to see if it's a kernel problem or something odd with the mount layout in userspace.
FTR: Alpine is on Kernel 5.15 as well and shiftfs.c has the same checksum as in Ubuntu. The shiftfs repo just uses the same branch for 5.15 and 5.16
I just booted the alpine VM with ubuntus 5.15.0-43-generic
and it works :thinking:
Alpine uses an (almost) unpatched 5.15.59
kernel. Do you already know which of Ubuntus patches might be necessary for this to work? If not I guess I'm gonna read the whole git history to try and find something relevant.
ok alpine kernel 5.15.39 works, and 5.15.59 doesn't. That means that Ubuntu will probably have the same issue as soon as the latest patch version gets merged.
I'm gonna start bisecting now.
it starts to break with 5.15.52. The commit that causes that is 38753e9173a5903e902c856b41fb325762bf5945.
I'm not yet sure why exactly that causes it or where exactly the EOVERFLOW
comes from. It's entirely possible that the problem is in zfs and not in shiftfs. They have plenty of EOVERFLOW
s in their code that make way more sense than shift_acl_ids
inside shiftfs.c. I'm currently testing with ZFS 2.1.5
.
Is there any way to force using shiftfs even when idmapped mounts are available so I can test this with a filesystem like ext4?
@stgraber IMO you should reopen this issue since at this point I'm pretty sure that ubuntu will have this issue after the next kernel update. nvm. It still wouldn't be a LXD bug.
the EOVERFLOW comes from here.
That totally makes sense since the breaking commit changed the implementation of the function fsuidgid_has_mapping
to check against fs_userns
instead of init_user_ns
.
I don't yet know why that's an issue since I don't yet understand that code but this sounds like shiftfs and idmapping have some sort of conflict here and that shiftfs basically assumes that idmappings don't exist.
okay I got it working with the latest shiftfs.c from the ubuntu kinetic kernel. e1b92741ef11bccde558ac7b16d72981a1e020b7 fixed it and the commit description matches everything I've seen so far.
To me that's good news since it means I don't have to maintain my own alpine kernel fork and can just update the shiftfs module instead.
@stgraber renames are still broken on the ubuntu kernel and this change is required:
diff --git a/shiftfs.c b/shiftfs.c
index a5338dc..46a7d05 100644
--- a/shiftfs.c
+++ b/shiftfs.c
@@ -632,10 +632,10 @@ static int shiftfs_rename(struct user_namespace *ns,
struct inode *loweri_dir_old = lowerd_dir_old->d_inode,
*loweri_dir_new = lowerd_dir_new->d_inode;
struct renamedata rd = {
- .old_mnt_userns = ns,
+ .old_mnt_userns = &init_user_ns,
.old_dir = loweri_dir_old,
.old_dentry = lowerd_old,
- .new_mnt_userns = ns,
+ .new_mnt_userns = &init_user_ns,
.new_dir = loweri_dir_new,
I'm not going to submit that to ubuntu since I think their contribution barrier is way too high due to their complicated processes, documentation and software.
This fix referenced above in e1b92741ef11bccde558ac7b16d72981a1e020b7 leaves me slightly concerned. The analysis isn't correct. A stacking filesystem like shifts or overlayfs calls vfs_*
helpers for the lower filesystem. And when it does so it needs to account for the properties of the lower filesystem not of shiftfs. IOW, passing down information from the shiftfs layer is almost always a bug. They frankly wouldn't have noticed this but on newer kernels an idmapped mount is either identified by having the init_user_ns or fs_userns != mnt_userns attached to it. So if shiftfs is mounted in a userns then fs_userns == mnt_userns (!= init_user_ns) meaning that they passed down shifts specific inofrmation to the lower filesystem.
The fix that they are outlined means that you're still allowing shiftfs to be mounted on top of idmapepd mounts which means things are broken there as well as the mount's idmapping isn't taken into account. So you either want to do what I did for overlayfs upstream to allow idmapped lower layers or for now at least you want sm like (untested):
diff --git a/shiftfs.c b/fixes.next
index a5338dc..71355d3 100644
--- a/shiftfs.c
+++ b/fixes.next
@@ -2083,6 +2083,17 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
cap_lower(cred_tmp->cap_effective, CAP_SYS_RESOURCE);
sbinfo->creator_cred = cred_tmp;
}
+
+ /*
+ * Supporting idmapped lower layers requires a decent amount of
+ * rework that involves passing down the mnt_userns from the
+ * lower layer into vfs_*() helpers.
+ */
+ if (is_idmapped_mnt(sbinfo->mnt)) {
+ err = -EINVAL;
+ printk(KERN_ERR "shiftfs: idmapped lower layers not supported\n");
+ goto out_put_path;
+ }however
shiftfs logic is still relying on the fact that these functions need to
use the main filesystem namespace.
} else {
/*
* This leg executes if we're admin capable in the namespace,
Thanks for the suggested check. I think it is sufficient because time should be invested into making every Filesystem idmapping compatible instead of doing the same for an obsolete Filesystem(=shiftfs). ZFS has a PR pending already though the latest comment brings up valid concerns which could delay things further.
For those coming here from Google after having their containers blown up by Ubuntu 5.15.0-52 update: workaround is to stop lxd, rmmod shiftfs
, add shiftfs
to modules blacklist, restart lxd.
(For reasons I can't really understand, two containers were working fine, but one with seemingly identical config was getting EOVERFLOW.)
Required information
Issue description
All containers using shiftfs are unable to create any files in the rootfs because that always fails with EOVERFLOW:
If I
nsenter -t INITPID -m sh
I can see that at least the UIDs are actually correct:Steps to reproduce
I'm not sure why alpine has that issue and a x86 ubuntu-VM using snap doesn't. Hints welcome. I'll try to reproduce the issue in a x86 alpine VM if nobody has any immediate ideas.