canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

Disk device does not show nested mounts/datasets #9918

Closed BobVul closed 2 years ago

BobVul commented 2 years ago

Required information

config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    snip
    -----END CERTIFICATE-----
  certificate_fingerprint: snip
  driver: qemu | lxc
  driver_version: 6.1.1 | 4.0.12
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.10.0-10-amd64
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Debian GNU/Linux
  os_version: "11"
  project: default
  server: lxd
  server_clustered: false
  server_name: test
  server_pid: 4052
  server_version: "4.22"
  storage: zfs
  storage_version: 2.0.3-9
  storage_supported_drivers:
  - name: lvm
    version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.43.0
    remote: false
  - name: zfs
    version: 2.0.3-9
    remote: false
  - name: ceph
    version: 15.2.14
    remote: true
  - name: btrfs
    version: 5.4.1
    remote: false
  - name: cephfs
    version: 15.2.14
    remote: true
  - name: dir
    version: "1"
    remote: false

Issue description

A brief description of the problem. Should include what you were attempting to do, what you did, what happened and what you expected to see happen.

Steps to reproduce

  1. Create a ZFS filesystem with nested datasets, e.g.: tank/datastore and tank/datastore/media
  2. Configure permissions appropriately, for a simple test chmod +w /tank/datastore and chown 100000:100000 /tank/datastore/media
  3. Mount the parent dataset in LXD using the disk device:
    datastore:
    path: /datastore
    source: /tank/datastore
    type: disk
  4. Check container has (write) access to the /datastore/media directory
  5. Restart the host OS (or otherwise cause a zpool export/import)
  6. Observe that after host restart, the container is no longer able to write inside the nested mount. It is also unable to see any files that were inside it. From the host, all looks normal, nested mount exists correctly. Looks like the container only sees the empty dir that exists in the parent (owned by host UID 0), not the mounted nested dataset.
  7. Going through a zfs umount tank/datastore/media followed by zfs mount tank/datastore/media cycle seems to fix it, until next reboot.
  8. Try setting propagation: rprivate or propagation: rshared in the disk config, which results in the following errors:
    lxc foo 20220215174844.422 ERROR    utils - utils.c:safe_mount:1218 - Invalid argument - Failed to mount "/var/snap/lxd/common/lxd/devices/foo/disk.datastore.datastore" onto "/var/snap/lxd/common/lxc//datastore"
    lxc foo 20220215174844.422 ERROR    conf - conf.c:mount_entry:2406 - Invalid argument - Failed to mount "/var/snap/lxd/common/lxd/devices/foo/disk.datastore.datastore" on "/var/snap/lxd/common/lxc//datastore"
    lxc foo 20220215174844.422 ERROR    conf - conf.c:lxc_setup:4370 - Failed to setup mount entries
    lxc foo 20220215174844.422 ERROR    start - start.c:do_start:1275 - Failed to setup container "foo"
    lxc foo 20220215174844.422 ERROR    sync - sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 3)
    lxc foo 20220215174844.426 WARN     network - network.c:lxc_delete_network_priv:3617 - Failed to rename interface with index 0 from "eth0" to its initial name "veth9f8ba29b"
    lxc foo 20220215174844.426 ERROR    start - start.c:__lxc_start:2074 - Failed to spawn container "foo"
    lxc foo 20220215174844.426 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:877 - Received container state "ABORTING" instead of "RUNNING"
    lxc foo 20220215174844.426 WARN     start - start.c:lxc_abort:1039 - No such process - Failed to send SIGKILL via pidfd 17 for process 12360
    lxc foo 20220215174849.612 WARN     conf - conf.c:lxc_map_ids:3588 - newuidmap binary is missing
    lxc foo 20220215174849.612 WARN     conf - conf.c:lxc_map_ids:3594 - newgidmap binary is missing
    lxc 20220215174849.638 ERROR    af_unix - af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
    lxc 20220215174849.639 ERROR    commands - commands.c:lxc_cmd_rsp_recv_fds:127 - Failed to receive file descriptors for command "get_state"

Information to attach

BobVul commented 2 years ago

Clearly I need to have my eyes checked, I'd completely missed the existence of a recursive option :\