canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

Invalid source directory for btrfs storage pool after snap upgrade #6359

Closed Lesterpig closed 5 years ago

Lesterpig commented 5 years ago

Required information

config:
  cluster.https_address: 192.168.1.8:8443
  core.https_address: 192.168.1.8:8443
  core.trust_password: true
  images.auto_update_interval: "0"
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 192.168.1.8:8443
  architectures:
  - armv7l
  certificate: <hidden>
  certificate_fingerprint: <hidden>
  driver: lxc
  driver_version: 3.2.1
  kernel: Linux
  kernel_architecture: armv7l
  kernel_features:
    netnsid_getifaddrs: "false"
    seccomp_listener: "false"
    shiftfs: "false"
    uevent_injection: "false"
    unpriv_fscaps: "true"
  kernel_version: 4.14.133-odroidxu4
  lxc_features:
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    seccomp_notify: "true"
  project: default
  server: lxd
  server_clustered: true
  server_name: kili
  server_pid: 26675
  server_version: "3.18"
  storage: btrfs
  storage_version: "4.4"

Issue description

After latest snap refresh, trying to start a container is impossible because local storage pools points to invalid directory.

Error: Common start logic: Failed to start device '<dev-name>': Source path /var/snap/lxd/common/lxd/storage-pools/local/custom/<dev-name> doesn't exist for device <dev-name>

Steps to reproduce

  1. Configure LXD cluster with snap <= 3.17 and btrfs pool on /mnt/pool
  2. Create a custom volume <dev-name> for a container <container-name>
  3. Refresh snap
  4. lxc start <container-name>

Information to attach

$ lxd sql global "SELECT * FROM storage_pools_config";
+----+-----------------+---------+-------------------------+---------------+
| id | storage_pool_id | node_id |           key           |     value     |
+----+-----------------+---------+-------------------------+---------------+
| 7  | 3               | 1       | source                  | /mnt/data/lxd |
| 8  | 3               | 1       | volatile.initial_source | /mnt/data/lxd |
| 9  | 3               | 2       | source                  | /mnt/data/lxd |
| 10 | 3               | 2       | volatile.initial_source | /mnt/data/lxd |
+----+-----------------+---------+-------------------------+---------------+
$ lxc storage show local
config: {} <----------- source location missing?
description: ""
name: local
driver: btrfs
used_by:
<hidden>
status: Created
locations:
- kili
- fili
$ lxc storage info local
info:
  description: ""
  driver: btrfs
  name: local
  space used: 6.65GB <----------- This is not expected
  total space: 31.12GB <----------- This is not expected
used by:
  containers:
<hidden>
  images:
<hidden>
  profiles:
  - default
  - default-fan
  storage-pools:
  - local
  - local
  - local
  - local
$ lxc config show <container-name> --expanded
architecture: armv7l
config:
  boot.autostart.priority: "1"
  image.architecture: armhf
  image.description: Alpine 3.8 armhf (20181004_13:28)
  image.os: Alpine
  image.release: "3.8"
  image.serial: "20181004_13:28"
  limits.memory: 256MB
  limits.memory.swap: "false"
  volatile.base_image: b1f25d332abc823609988e9b4524e9c016fc8c088249561d3a2fd8e2d2568985
  volatile.eth0.host_name: veth7e26223d
  volatile.eth0.hwaddr: 00:16:3e:28:17:cd
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: moria2
    type: nic
  <dev-name>:
    path: /mnt
    pool: local
    source: <dev-name>
    type: disk
  root:
    path: /
    pool: local
    size: 10GB
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
location: kili
metadata:
  class: task
  created_at: "2019-10-28T14:48:00.95041098+01:00"
  description: Starting container
  err: 'Common start logic: Failed to start device ''<dev-name>'': Source path /var/snap/lxd/common/lxd/storage-pools/local/custom/<dev-name>
    doesn''t exist for device <dev-name>'
  id: 2375a107-3e6e-45fe-a439-9d7de58ccfdd
  location: kili
  may_cancel: false
  metadata: null
  resources:
    containers:
    - /1.0/containers/<container-name>
  status: Failure
  status_code: 400
  updated_at: "2019-10-28T14:48:00.95041098+01:00"
timestamp: "2019-10-28T14:48:01.038094758+01:00"
type: operation
stgraber commented 5 years ago

What do you see if you do:

On host kili?

Also, can you show lxc config show --expanded <container name> and lxc storage volume list local?

Lesterpig commented 5 years ago

Thanks for your help!

lxc config show <container-name> --expanded
architecture: armv7l
config:
  boot.autostart.priority: "1"
  image.architecture: armhf
  image.description: Alpine 3.8 armhf (20181004_13:28)
  image.os: Alpine
  image.release: "3.8"
  image.serial: "20181004_13:28"
  limits.memory: 256MB
  limits.memory.swap: "false"
  volatile.base_image: b1f25d332abc823609988e9b4524e9c016fc8c088249561d3a2fd8e2d2568985
  volatile.eth0.host_name: veth7e26223d
  volatile.eth0.hwaddr: 00:16:3e:28:17:cd
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: moria2
    type: nic
  <dev-name>:
    path: /mnt
    pool: local
    source: <dev-name>
    type: disk
  root:
    path: /
    pool: local
    size: 10GB
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
container,<container-name>,,1,kili
custom,<dev-name>,,1,kili

I restarted lxc multiple time, killing remaining processes (lxcfs) when needed, but this completely crashed my cluster: all containers in STOPPED state, unable to start with the aforementioned issue. So, I went the hard way by rebooting all nodes: this solved the issue

So at the end I think snap did not manage to remount the storage pool after upgrade and multiple restarts. Not sure if this is a bug, please close the issue if appropriate. :+1:

stgraber commented 5 years ago

Ah, ok, so hopefully this was the mount propagation issue we fixed in the snap packaging a bit over a week ago and now that you've restarted those systems, the mount table makes sense again and things will behave going forward.