jsimpso commented 2 years ago

Required information

Distribution: Ubuntu
Distribution version: 22.04.4

The output of "lxc info":

jsimpso@alcazar:~$ lxc info
config:
core.https_address: '[::]:8443'
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
addresses:
- 192.168.37.6:8443
- 10.133.220.1:8443
- '[fd42:8de0:bb24:3ce3::1]:8443'
architectures:
- x86_64
- i686
certificate: |
-----BEGIN CERTIFICATE-----
<snip>
-----END CERTIFICATE-----
certificate_fingerprint: 9539a5c67e0fcdf9813278c22264dcb5906a8e7f8b98229fb76afdd6beddaecd
driver: lxc | qemu
driver_version: 4.0.12 | 6.1.1
firewall: nftables
kernel: Linux
kernel_architecture: x86_64
kernel_features:
idmapped_mounts: "true"
netnsid_getifaddrs: "true"
seccomp_listener: "true"
seccomp_listener_continue: "true"
shiftfs: "false"
uevent_injection: "true"
unpriv_fscaps: "true"
kernel_version: 5.15.0-33-generic
lxc_features:
cgroup2: "true"
core_scheduling: "true"
devpts_fd: "true"
idmapped_mounts_v2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "true"
seccomp_allow_deny_syntax: "true"
seccomp_notify: "true"
seccomp_proxy_send_notify_fd: "true"
os_name: Ubuntu
os_version: "20.04"
project: maas
server: lxd
server_clustered: false
server_event_mode: full-mesh
server_name: alcazar
server_pid: 1352
server_version: "5.1"
storage: dir
storage_version: "1"
storage_supported_drivers:
- name: lvm
version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.45.0
remote: false
- name: zfs
version: 2.1.2-1ubuntu3
remote: false
- name: ceph
version: 15.2.16
remote: true
- name: btrfs
version: 5.4.1
remote: false
- name: cephfs
version: 15.2.16
remote: true
- name: dir
version: "1"
remote: false

Issue description

In a circumstance where the size of a VM's disk is larger than the amount of available space in the LXD storage backend, trying to move that VM between projects results in LXD trying to make a full copy of the VM and failing:

jsimpso@alcazar:~$ lxc storage list
+---------+--------+------------------------------------------------+-------------+---------+---------+
|  NAME   | DRIVER |                     SOURCE                     | DESCRIPTION | USED BY |  STATE  |
+---------+--------+------------------------------------------------+-------------+---------+---------+
| default | dir    | /var/snap/lxd/common/lxd/storage-pools/default |             | 6       | CREATED |
+---------+--------+------------------------------------------------+-------------+---------+---------+

jsimpso@alcazar:~$ df -h /var/snap/lxd/common/lxd/storage-pools/default
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       110G   99G  4.7G  96% /

jsimpso@alcazar:~$ sudo ls -lh /var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/maas_alpha/root.img
-rw-r--r-- 1 root root 56G Jun 22 11:17 /var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/maas_alpha/root.img

jsimpso@alcazar:~$ lxc move alpha alpha --project maas --target-project default
Error: Migration operation failure: Create instance from copy: Failed to run: nice -n19 dd if=/var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/maas_alpha/root.img of=/var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/alpha/root.img bs=16M conv=nocreat iflag=direct oflag=direct: dd: error writing '/var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/alpha/root.img': No space left on device
652+0 records in
651+0 records out
10921967616 bytes (11 GB, 10 GiB) copied, 78.3542 s, 139 MB/s

If other VMs are attempting to write to that storage backend when it fills up, they appear to stall, and LXD loses communication with them. Once the qemu process has been killed, LXD can re-launch the VM as expected.

Steps to reproduce

Have two projects present so that we have something to move between

jsimpso@alcazar:~$ lxc project list
+----------------+--------+----------+-----------------+----------+-------------------------+---------+
|      NAME      | IMAGES | PROFILES | STORAGE VOLUMES | NETWORKS |       DESCRIPTION       | USED BY |
+----------------+--------+----------+-----------------+----------+-------------------------+---------+
| default        | YES    | YES      | YES             | YES      | Default LXD project     | 3       |
+----------------+--------+----------+-----------------+----------+-------------------------+---------+
| maas (current) | NO     | YES      | NO              | NO       | Project managed by MAAS | 5       |
+----------------+--------+----------+-----------------+----------+-------------------------+---------+

Having a storage backend with relatively little storage will make for easier testing, I tested with a 120GB disk.

Initialise a few VMs, give one a bigger disk:

lxc init ubuntu/jammy alpha --vm
ize=60GB
lxc start alpha
lxc launch ubuntu/jammy beta --vm
lxc launch ubuntu/jammy gamma --vm
lxc launch ubuntu/jammy delta --vm

Grow the VM disk:

lxc exec alpha -- dd if=/dev/zero of=anchor.file bs=1M count=50000 status=progress

You should now have a VM whose disk is larger than the amount of available store in the backend:

jsimpso@alcazar:~$ df -h /var/snap/lxd/common/lxd/storage-pools/default
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       110G   61G   43G  59% /

jsimpso@alcazar:~$ sudo ls -lh /var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/maas_alpha/root.img
-rw-r--r-- 1 root root 56G Jun 22 09:00 /var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/maas_alpha/root.img

Open a connection to each running VM (lxd exec <vm> bash)

On two VMs, start slowly writing to disk. Then, move the large VM between projects:

root@beta:~# dd if=/dev/zero of=anchor.file bs=1 count=5000000000000 status=progress
root@delta:~# dd if=/dev/zero of=anchor.file bs=1 count=5000000000000 status=progress
jsimpso@alcazar:~$ lxc move alpha alpha --project maas --target-project default

During the move, the two VMs attempting to write to disk fully unresponsive, but the last one continues to respond to input.

We can then see that the qemu process is still running, but LXD has lost contact with it:

jsimpso@alcazar:~$ lxc move alpha alpha --project maas --target-project default
Error: Migration operation failure: Create instance from copy: Failed to run: nice -n19 dd if=/var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/maas_alpha/root.img of=/var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/alpha/root.img bs=16M conv=nocreat iflag=direct oflag=direct: dd: error writing '/var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/alpha/root.img': No space left on device
652+0 records in
651+0 records out
10921967616 bytes (11 GB, 10 GiB) copied, 78.3542 s, 139 MB/s

jsimpso@alcazar:~$ lxc ls
+-------+---------+-------------------------+------+-----------------+-----------+
| NAME  |  STATE  |          IPV4           | IPV6 |      TYPE       | SNAPSHOTS |
+-------+---------+-------------------------+------+-----------------+-----------+
| alpha | STOPPED |                         |      | VIRTUAL-MACHINE | 0         |
+-------+---------+-------------------------+------+-----------------+-----------+
| beta  | STOPPED |                         |      | VIRTUAL-MACHINE | 0         |
+-------+---------+-------------------------+------+-----------------+-----------+
| delta | STOPPED |                         |      | VIRTUAL-MACHINE | 0         |
+-------+---------+-------------------------+------+-----------------+-----------+
| gamma | RUNNING | 192.168.37.166 (enp5s0) |      | VIRTUAL-MACHINE | 0         |
+-------+---------+-------------------------+------+-----------------+-----------+

jsimpso@alcazar:~$ lxc start beta
Error: Failed cleaning config drive mount path "/var/snap/lxd/common/lxd/devices/maas_beta/config.mount": Failed unmounting "/var/snap/lxd/common/lxd/devices/maas_beta/config.mount": Failed to unmount '/var/snap/lxd/common/lxd/devices/maas_beta/config.mount': device or resource busy
Try `lxc info --show-log beta` for more info
jsimpso@alcazar:~$ lxc info --show-log beta
Name: beta
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Created: 2022/06/22 11:11 AWST
Last Used: 2022/06/22 11:11 AWST
Error: open /var/snap/lxd/common/lxd/logs/maas_beta/qemu.log: no such file or directory

jsimpso@alcazar:~$ ps aux |  grep beta
root        9911  0.0  0.0  79904  3424 ?        Ssl  11:11   0:00 /snap/lxd/23037/bin/virtiofsd --fd=3 -o source=/var/snap/lxd/common/lxd/devices/maas_beta/config.mount
lxd         9927 19.4  7.6 1934532 614632 ?      Sl   11:11   2:27 /snap/lxd/23037/bin/qemu-system-x86_64 -S -name beta -uuid 159e794b-17e6-4658-869e-73f996996395 -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/maas_beta/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/maas_beta/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/maas_beta/qemu.pid -D /var/snap/lxd/common/lxd/logs/maas_beta/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd
root        9930  0.0  0.1 1211832 13584 ?       Sl   11:11   0:00 /snap/lxd/23037/bin/virtiofsd --fd=3 -o source=/var/snap/lxd/common/lxd/devices/maas_beta/config.mount
jsimpso    10698  0.0  0.2 1384848 17496 pts/1   Sl+  11:14   0:00 lxc exec beta bash

Killing the process returns the console:

jsimpso@alcazar:~$ sudo kill 9927
jsimpso@alcazar:~$
----
root@beta:~# dd if=/dev/zero of=anchor.file bs=1 count=5000000000000 status=progress
16908336 bytes (17 MB, 16 MiB) copied, 61 s, 277 kB/sError: read vsock vm(4294967295):1873106686->vm(31):8443: connection reset by peer
jsimpso@alcazar:~$

And the VM can be started up again:

jsimpso@alcazar:~$ lxc ls
+-------+---------+-------------------------+------+-----------------+-----------+
| NAME  |  STATE  |          IPV4           | IPV6 |      TYPE       | SNAPSHOTS |
+-------+---------+-------------------------+------+-----------------+-----------+
| alpha | STOPPED |                         |      | VIRTUAL-MACHINE | 0         |
+-------+---------+-------------------------+------+-----------------+-----------+
| beta  | RUNNING |                         |      | VIRTUAL-MACHINE | 0         |
+-------+---------+-------------------------+------+-----------------+-----------+
| delta | STOPPED |                         |      | VIRTUAL-MACHINE | 0         |
+-------+---------+-------------------------+------+-----------------+-----------+
| gamma | RUNNING | 192.168.37.166 (enp5s0) |      | VIRTUAL-MACHINE | 0         |
+-------+---------+-------------------------+------+-----------------+-----------+
jsimpso@alcazar:~$

stgraber commented 2 years ago

So the fact that LXD attempts the migration is normal. We never try to guess the disk space and instead just go ahead and if we're going to hit ENOSPC, then that's going to be our cue that there isn't enough space.

The reason for that is that different storage backends store things very very differently. You can easily do cat /dev/null > out.img or the like on a VM stored on a dir storage backend, make it go to 100GiB, then move it to a project that uses a zfs storage pool and ZFS will happily store that in something like 500MiB thanks to the default inline compression.

Now, these days a project move shouldn't actually cause any data to be copied, so we'd need to check on current LXD to see what's going on there. It's also obviously problematic if QEMU just crashes or stops responding in such a situation when it itself isn't supposed to be writing anything to disk.

tomponline commented 2 years ago

Yeah seems like the target project flag is confusing the same pool move detection somehow.

tomponline commented 1 year ago

Right, so I've taken a look at whats going on here, and the behaviour is that when using instances on the dir pool the storage subsystem invokes CreateInstanceFromCopy, this then detects that the source and target are in the same pool and switches to "same-pool mode" which then invokes CreateVolumeFromCopy on the storage driver.

In the case of a storage driver that supports optimized copies (snapshots) this is very quick and doesn't take much disk space as the copy is just a snapshot of the source instance (and then in the case of ZFS, the source instance volumes are kept around but hidden as they are still referenced by the new instance's name).

In the case of the dir storage driver however, CreateVolumeFromCopy invokes genericVFSCopyVolume which uses the dd command to copy the VM's root disk device.

Basically the problem here is that there isn't a "move" concept in the storage subsystem (https://github.com/lxc/lxd/blob/632b3839b3388cc89df0ffe96d7c7c9134f00a7b/lxd/storage/pool_interface.go#L39-L145) and so the lxc move is just a lxc copy followed by a lxc delete (effectively).

We can see this where its implemented in the server side:

https://github.com/lxc/lxd/blob/632b3839b3388cc89df0ffe96d7c7c9134f00a7b/lxd/instance_post.go#L486-L553

So I'm going to put this down as a feature & maybe, as its not a bug per-se.

canonical / lxd

Moving instances between projects can require double storage capacity #10590

Required information

Issue description

Steps to reproduce