M1cha commented 2 years ago

Required information

Distribution: Alpine
Distribution version: edge

The output of "lxc info":

config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
addresses: []
architectures:
- aarch64
- armv7l
certificate: REDACTED
certificate_fingerprint: 77144852aa5a4705ad12c28520e193e826bf6323654d0c726cee763a9cd95813
driver: lxc | qemu
driver_version: 4.0.12 | 7.0.0
firewall: nftables
kernel: Linux
kernel_architecture: aarch64
kernel_features:
idmapped_mounts: "true"
netnsid_getifaddrs: "true"
seccomp_listener: "true"
seccomp_listener_continue: "true"
shiftfs: "true"
uevent_injection: "true"
unpriv_fscaps: "true"
kernel_version: 5.15.59-0-lts
lxc_features:
cgroup2: "true"
core_scheduling: "true"
devpts_fd: "true"
idmapped_mounts_v2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "true"
seccomp_allow_deny_syntax: "true"
seccomp_notify: "true"
seccomp_proxy_send_notify_fd: "true"
os_name: Alpine Linux
os_version: 3.17_alpha20220715
project: default
server: lxd
server_clustered: false
server_event_mode: full-mesh
server_name: lxd
server_pid: 3207
server_version: "5.2"
storage: btrfs | zfs
storage_version: 5.18.1 | 2.1.5-1
storage_supported_drivers:
- name: btrfs
version: 5.18.1
remote: false
- name: dir
version: "1"
remote: false
- name: zfs
version: 2.1.5-1
remote: false

Issue description

All containers using shiftfs are unable to create any files in the rootfs because that always fails with EOVERFLOW:

# mkdir /tmp/test
mkdir: can't create directory '/tmp/test': Value too large for data type

If I nsenter -t INITPID -m sh I can see that at least the UIDs are actually correct:

# ls -lah
total 72K
drwxr-xr-x   19 655360   655360        19 Aug  1 13:04 .
drwxr-xr-x   19 655360   655360        19 Aug  1 13:04 ..
drwxr-xr-x    2 655360   655360        88 Aug  1 13:01 bin
drwxr-xr-x    7 655360   655360       440 Aug 10 06:31 dev
drwxr-xr-x   21 655360   655360        48 Aug  1 13:04 etc
drwxr-xr-x    2 655360   655360         2 May 23 16:53 home
drwxr-xr-x    8 655360   655360        19 Aug  1 13:01 lib
drwxr-xr-x    5 655360   655360         5 May 23 16:53 media
drwxr-xr-x    2 655360   655360         2 May 23 16:53 mnt
drwxr-xr-x    2 655360   655360         2 May 23 16:53 opt
dr-xr-xr-x  330 root     root           0 Aug 10 06:31 proc
drwx------    2 655360   655360         2 May 23 16:53 root
drwxr-xr-x    4 655360   655360       240 Aug 10 06:31 run
drwxr-xr-x    2 655360   655360       106 Aug  1 13:01 sbin
drwxr-xr-x    2 655360   655360         2 May 23 16:53 srv
dr-xr-xr-x   13 root     root           0 Aug 10 06:31 sys
drwxrwxrwt    2 655360   655360         2 May 23 16:53 tmp
drwxr-xr-x    8 655360   655360         8 Aug  1 13:01 usr
drwxr-xr-x   11 655360   655360        12 Aug 10 06:31 var

Steps to reproduce

I'm not sure why alpine has that issue and a x86 ubuntu-VM using snap doesn't. Hints welcome. I'll try to reproduce the issue in a x86 alpine VM if nobody has any immediate ideas.

brauner commented 2 years ago

Can you show the output of findmnt from inside the container?

M1cha commented 2 years ago

Can you show the output of findmnt from inside the container?

of course:

~ # findmnt
TARGET                          SOURCE                      FSTYPE    OPTIONS
/                               /var/lib/lxd/containers/nice-chamois/rootfs
                                                            shiftfs   rw,relatime,passthrough=3
├─/run                          tmpfs                       tmpfs     rw,nosuid,nodev,size=791024k,nr_i
├─/dev                          none                        tmpfs     rw,relatime,size=492k,mode=755,ui
│ ├─/dev/fuse                   devtmpfs[/fuse]             devtmpfs  rw,nosuid,noexec,relatime,size=10
│ ├─/dev/net/tun                devtmpfs[/net/tun]          devtmpfs  rw,nosuid,noexec,relatime,size=10
│ ├─/dev/mqueue                 mqueue                      mqueue    rw,nosuid,nodev,noexec,relatime
│ ├─/dev/lxd                    tmpfs                       tmpfs     rw,relatime,size=100k,mode=755,in
│ ├─/dev/.lxd-mounts            tmpfs[/nice-chamois]        tmpfs     rw,relatime,size=100k,mode=711,in
│ ├─/dev/full                   devtmpfs[/full]             devtmpfs  rw,nosuid,noexec,relatime,size=10
│ ├─/dev/null                   devtmpfs[/null]             devtmpfs  rw,nosuid,noexec,relatime,size=10
│ ├─/dev/random                 devtmpfs[/random]           devtmpfs  rw,nosuid,noexec,relatime,size=10
│ ├─/dev/tty                    devtmpfs[/tty]              devtmpfs  rw,nosuid,noexec,relatime,size=10
│ ├─/dev/urandom                devtmpfs[/urandom]          devtmpfs  rw,nosuid,noexec,relatime,size=10
│ ├─/dev/zero                   devtmpfs[/zero]             devtmpfs  rw,nosuid,noexec,relatime,size=10
│ ├─/dev/pts                    devpts                      devpts    rw,nosuid,noexec,relatime,gid=720
│ ├─/dev/ptmx                   devpts[/ptmx]               devpts    rw,nosuid,noexec,relatime,gid=720
│ └─/dev/console                devpts[/0]                  devpts    rw,nosuid,noexec,relatime,gid=720
├─/proc                         proc                        proc      rw,nosuid,nodev,noexec,relatime
│ ├─/proc/sys/kernel/random/boot_id
│ │                             none[/.lxc-boot-id]         tmpfs     ro,nosuid,nodev,noexec,relatime,s
│ ├─/proc/sys/fs/binfmt_misc    proc[/sys/fs/binfmt_misc]   proc      rw,nosuid,nodev,noexec,relatime
│ ├─/proc/cpuinfo               lxcfs[/proc/cpuinfo]        fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ ├─/proc/diskstats             lxcfs[/proc/diskstats]      fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ ├─/proc/loadavg               lxcfs[/proc/loadavg]        fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ ├─/proc/meminfo               lxcfs[/proc/meminfo]        fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ ├─/proc/stat                  lxcfs[/proc/stat]           fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ ├─/proc/swaps                 lxcfs[/proc/swaps]          fuse.lxcf rw,nosuid,nodev,relatime,user_id=
│ └─/proc/uptime                lxcfs[/proc/uptime]         fuse.lxcf rw,nosuid,nodev,relatime,user_id=
└─/sys                          sysfs                       sysfs     rw,relatime
  ├─/sys/fs/fuse/connections    sysfs[/fs/fuse/connections] sysfs     rw,nosuid,nodev,noexec,relatime
  ├─/sys/fs/pstore              pstore                      pstore    rw,nosuid,nodev,noexec,relatime
  ├─/sys/kernel/debug           debugfs                     debugfs   rw,nosuid,nodev,noexec,relatime
  │ └─/sys/kernel/debug/tracing tracefs                     tracefs   rw,nosuid,nodev,noexec,relatime
  ├─/sys/kernel/security        securityfs                  securityf rw,nosuid,nodev,noexec,relatime
  ├─/sys/kernel/tracing         sysfs[/kernel/tracing]      sysfs     rw,nosuid,nodev,noexec,relatime
  ├─/sys/fs/cgroup              none                        cgroup2   rw,nosuid,nodev,noexec,relatime
  └─/sys/devices/system/cpu/online
                                lxcfs[/sys/devices/system/cpu/online]
                                                            fuse.lxcf rw,nosuid,nodev,relatime,user_id=

brauner commented 2 years ago

Ah yes, it is really shiftfs and not idmapped mounts. That is very odd because I would think that the 5.15 kernel on Alpine does support them.

M1cha commented 2 years ago

That's because idmapped mounts are not supported by ZFS yet. ubuntu seems to be able to use shiftfs on top of ZFS though.

brauner commented 2 years ago

So you're using an Alpine vm and the Alpine vm uses zfs as the root filesystem?

M1cha commented 2 years ago

no I'm using a physical aarch64 device with a squashfs(+overlaytmpfs) rootfs and a zfs storage pool for LXD.

brauner commented 2 years ago

no I'm using a physical aarch64 device with a squashfs(+overlaytmpfs) rootfs and a zfs storage pool for LXD.

So you're running an Alpine container on top of zfs, right?

M1cha commented 2 years ago

Oh you meant the guest, yes that's an images:alpine/3.16 container(not a VM)

brauner commented 2 years ago

Can you show me findmnt on the physical aarch64 if that's not something you'd rather not share/

M1cha commented 2 years ago

here you go: (I removed unrelated containers)

# findmnt
TARGET                                                     SOURCE                       FSTYPE  OPTIONS
/                                                          overlayfs                    overlay rw,rela
├─/sys                                                     sysfs                        sysfs   rw,nosu
│ ├─/sys/kernel/security                                   securityfs                   securit rw,nosu
│ ├─/sys/kernel/debug                                      debugfs                      debugfs rw,nosu
│ │ └─/sys/kernel/debug/tracing                            tracefs                      tracefs rw,nosu
│ ├─/sys/fs/pstore                                         pstore                       pstore  rw,nosu
│ └─/sys/fs/cgroup                                         none                         cgroup2 rw,nosu
├─/dev                                                     devtmpfs                     devtmpf rw,nosu
│ ├─/dev/pts                                               devpts                       devpts  rw,nosu
│ ├─/dev/shm                                               shm                          tmpfs   rw,nosu
│ └─/dev/mqueue                                            mqueue                       mqueue  rw,nosu
├─/proc                                                    proc                         proc    rw,nosu
├─/media/root-ro                                           /dev/mmcblk1p7               squashf ro,rela
├─/media/root-rw                                           root-tmpfs                   tmpfs   rw,rela
├─/run                                                     tmpfs                        tmpfs   rw,nosu
├─/var                                                     /dev/sda1                    ext4    rw,rela
│ ├─/var/lib/lxcfs                                         lxcfs                        fuse.lx rw,nosu
│ ├─/var/lib/lxd/shmounts                                  tmpfs                        tmpfs   rw,rela
│ ├─/var/lib/lxd/devlxd                                    tmpfs                        tmpfs   rw,rela
│ ├─/var/lib/lxd/storage-pools/btrfs                       /dev/sda3                    btrfs   rw,rela
│ └─/var/lib/lxd/storage-pools/default/containers/nice-chamois
│                                                          default/containers/nice-chamois
│                                                                                       zfs     rw,rela
├─/boot                                                    /dev/mmcblk1p1               vfat    rw,rela
└─/media/config                                            /dev/mmcblk1p5               ext4    rw,rela

M1cha commented 2 years ago

I managed to create a cloud-init based x64 LXD VM where the issue can be reproduced. To be clear: The issue happens inside the LXD container inside the LXD VM.

config:
  cloud-init.user-data: |
    #cloud-config
    write_files:
      - path: /etc/lxc/default.conf
        permissions: '0644'
        content: |
          lxc.net.0.type = empty
          lxc.idmap = u 0 100000 1000000000
          lxc.idmap = g 0 100000 1000000000

      - path: /etc/subuid
        permissions: '0644'
        content: |
          root:100000:1000000000

      - path: /etc/subgid
        permissions: '0644'
        content: |
          root:100000:1000000000

      - path: /etc/local.d/cgroup-initscope.start
        permissions: '0755'
        content: |
          #!/bin/sh
          mkdir -m 0755 -p /sys/fs/cgroup/init.scope

      - path: /etc/modules-load.d/shiftfs.conf
        permissions: '0644'
        content: |
          shiftfs

    runcmd:
      - echo "https://dl-cdn.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories
      - apk update
      - apk upgrade
      - apk add
        apparmor
        apparmor-profiles
        apparmor-utils
        chrony
        cloud-utils-growpart
        e2fsprogs
        e2fsprogs-extra
        eudev
        eudev-netifnames
        git
        linux-virt-dev
        lxcfs
        lxd-feature
        make
        nftables
        zfs
        zfs-udev

      - |
        cat >> /etc/rc.conf <<EOF
        rc_cgroup_mode="unified"
        rc_logger="YES"
        rc_parallel="YES"
        EOF

      - growpart /dev/sda 2
      - resize2fs /dev/sda2

      - rc-update del mdev sysinit
      - rc-update add localmount sysinit
      - rc-update add zfs-import sysinit
      - rc-update add zfs-mount sysinit

      - rc-update add cgroups boot
      - rc-update add local boot

      - rc-update add chronyd default
      - rc-update add lxcfs default
      - rc-update add lxd default

      - git clone https://github.com/toby63/shiftfs-dkms.git -b k5.16
      - make -C shiftfs-dkms
      - ln -s /shiftfs-dkms/shiftfs.ko /lib/modules/$(uname -r)/
      - depmod -a

to reproduce:

lxc launch --vm images:alpine/edge/cloud -c security.secureboot=false alpine < alpine-shiftfs-bug.yaml
lxc exec alpine -- sh
lxd init --auto --storage-backend zfs
lxc launch images:alpine/3.16 a1
lxc exec a1 -- sh
touch /testfile

stgraber commented 2 years ago

I'm closing this issue, not because we don't care about but because it's not a LXD bug.

I'm sure @brauner will still keep the chat going on here. Given we're not seeing this on Ubuntu, I wonder if it may be an incorrect port to 5.16 (Ubuntu is on 5.15)?

I don't know how easy it would be for you to transplant an Ubuntu 5.15 kernel onto your Alpine VM, but if doable, that'd be an easy way to see if it's a kernel problem or something odd with the mount layout in userspace.

M1cha commented 2 years ago

FTR: Alpine is on Kernel 5.15 as well and shiftfs.c has the same checksum as in Ubuntu. The shiftfs repo just uses the same branch for 5.15 and 5.16

M1cha commented 2 years ago

I just booted the alpine VM with ubuntus 5.15.0-43-generic and it works :thinking: Alpine uses an (almost) unpatched 5.15.59 kernel. Do you already know which of Ubuntus patches might be necessary for this to work? If not I guess I'm gonna read the whole git history to try and find something relevant.

M1cha commented 2 years ago

ok alpine kernel 5.15.39 works, and 5.15.59 doesn't. That means that Ubuntu will probably have the same issue as soon as the latest patch version gets merged.

I'm gonna start bisecting now.

M1cha commented 2 years ago

it starts to break with 5.15.52. The commit that causes that is 38753e9173a5903e902c856b41fb325762bf5945.

I'm not yet sure why exactly that causes it or where exactly the EOVERFLOW comes from. It's entirely possible that the problem is in zfs and not in shiftfs. They have plenty of EOVERFLOWs in their code that make way more sense than shift_acl_ids inside shiftfs.c. I'm currently testing with ZFS 2.1.5.

Is there any way to force using shiftfs even when idmapped mounts are available so I can test this with a filesystem like ext4?

@stgraber ~~IMO you should reopen this issue since at this point I'm pretty sure that ubuntu will have this issue after the next kernel update.~~ nvm. It still wouldn't be a LXD bug.

M1cha commented 2 years ago

the EOVERFLOW comes from here.

That totally makes sense since the breaking commit changed the implementation of the function fsuidgid_has_mapping to check against fs_userns instead of init_user_ns.

I don't yet know why that's an issue since I don't yet understand that code but this sounds like shiftfs and idmapping have some sort of conflict here and that shiftfs basically assumes that idmappings don't exist.

M1cha commented 2 years ago

okay I got it working with the latest shiftfs.c from the ubuntu kinetic kernel. e1b92741ef11bccde558ac7b16d72981a1e020b7 fixed it and the commit description matches everything I've seen so far.

To me that's good news since it means I don't have to maintain my own alpine kernel fork and can just update the shiftfs module instead.

M1cha commented 2 years ago

@stgraber renames are still broken on the ubuntu kernel and this change is required:

diff --git a/shiftfs.c b/shiftfs.c
index a5338dc..46a7d05 100644
--- a/shiftfs.c
+++ b/shiftfs.c
@@ -632,10 +632,10 @@ static int shiftfs_rename(struct user_namespace *ns,
        struct inode *loweri_dir_old = lowerd_dir_old->d_inode,
                     *loweri_dir_new = lowerd_dir_new->d_inode;
        struct renamedata rd = {
-               .old_mnt_userns = ns,
+               .old_mnt_userns = &init_user_ns,
                .old_dir        = loweri_dir_old,
                .old_dentry     = lowerd_old,
-               .new_mnt_userns = ns,
+               .new_mnt_userns = &init_user_ns,
                .new_dir        = loweri_dir_new,

I'm not going to submit that to ubuntu since I think their contribution barrier is way too high due to their complicated processes, documentation and software.

brauner commented 2 years ago

This fix referenced above in e1b92741ef11bccde558ac7b16d72981a1e020b7 leaves me slightly concerned. The analysis isn't correct. A stacking filesystem like shifts or overlayfs calls vfs_* helpers for the lower filesystem. And when it does so it needs to account for the properties of the lower filesystem not of shiftfs. IOW, passing down information from the shiftfs layer is almost always a bug. They frankly wouldn't have noticed this but on newer kernels an idmapped mount is either identified by having the init_user_ns or fs_userns != mnt_userns attached to it. So if shiftfs is mounted in a userns then fs_userns == mnt_userns (!= init_user_ns) meaning that they passed down shifts specific inofrmation to the lower filesystem.

The fix that they are outlined means that you're still allowing shiftfs to be mounted on top of idmapepd mounts which means things are broken there as well as the mount's idmapping isn't taken into account. So you either want to do what I did for overlayfs upstream to allow idmapped lower layers or for now at least you want sm like (untested):

diff --git a/shiftfs.c b/fixes.next
index a5338dc..71355d3 100644
--- a/shiftfs.c
+++ b/fixes.next
@@ -2083,6 +2083,17 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
                        cap_lower(cred_tmp->cap_effective, CAP_SYS_RESOURCE);
                        sbinfo->creator_cred = cred_tmp;
                }
+
+               /*
+                * Supporting idmapped lower layers requires a decent amount of
+                * rework that involves passing down the mnt_userns from the
+                * lower layer into vfs_*() helpers.
+                */
+               if (is_idmapped_mnt(sbinfo->mnt)) {
+                       err = -EINVAL;
+                       printk(KERN_ERR "shiftfs: idmapped lower layers not supported\n");
+                       goto out_put_path;
+               }however
shiftfs logic is still relying on the fact that these functions need to
use the main filesystem namespace.

        } else {
                /*
                 * This leg executes if we're admin capable in the namespace,

M1cha commented 2 years ago

Thanks for the suggested check. I think it is sufficient because time should be invested into making every Filesystem idmapping compatible instead of doing the same for an obsolete Filesystem(=shiftfs). ZFS has a PR pending already though the latest comment brings up valid concerns which could delay things further.

vuori commented 1 year ago

For those coming here from Google after having their containers blown up by Ubuntu 5.15.0-52 update: workaround is to stop lxd, rmmod shiftfs, add shiftfs to modules blacklist, restart lxd.

(For reasons I can't really understand, two containers were working fine, but one with seemingly identical config was getting EOVERFLOW.)

mihalicyn commented 1 year ago

https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/fs/shiftfs.c?h=master-next&id=6d1703cb41f2a7720ce227c5f86b9d2af989c7c4

https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/fs/shiftfs.c?h=master-next&id=4246c2678b2271f769b124e5a0c410e8a0fd5c3c

canonical / lxd

shiftfs EOVERFLOW #10764

Required information

Issue description

Steps to reproduce