vosdev commented 2 years ago

Required information

Distribution: Ubuntu
Distribution version: 20.04 LTS
The output of "lxc info" or if that fails:
- Kernel version: 5.11.0-40-generic
- LXC version:
- LXD version: 4.21 Snap
- Storage backend in use: btrfs + ceph
```
~
root @ node1 # lxc info
config:
cluster.https_address: 10.121.1.1:8443
core.https_address: 10.121.1.1:8443
core.trust_password: true
api_extensions:
```
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du api_status: stable api_version: "1.0" auth: trusted public: false auth_methods:
- tls environment: addresses:
- 10.121.1.1:8443 architectures:
- x86_64
- i686 certificate: | -----BEGIN CERTIFICATE----- -----END CERTIFICATE----- certificate_fingerprint: 550161c2aa378270121bbe171db32f9d50219e407b9a0aa2fd191108e3616909 driver: qemu | lxc driver_version: 6.2.0 | 4.0.11 firewall: xtables kernel: Linux kernel_architecture: x86_64 kernel_features: netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" shiftfs: "false" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 5.11.0-40-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "20.04" project: default server: lxd server_clustered: true server_name: node1 server_pid: 3903145 server_version: "4.21" storage: ceph | btrfs | dir storage_version: 16.2.7 | 5.4.1 | 1 storage_supported_drivers:
- name: zfs version: 2.0.2-1ubuntu5.4 remote: false
- name: ceph version: 16.2.7 remote: true
- name: btrfs version: 5.4.1 remote: false
- name: cephfs version: 16.2.7 remote: true
- name: dir version: "1" remote: false
- name: lvm version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.43.0 remote: false

Issue description

I evacuated a node for a reboot and restored it after it came back online.

~
root @ node1 # lxc cluster restore node3
Are you sure you want to restore cluster member "node3"? (yes/no) [default=no]: yes
Error: Failed to start instance "k8s-dev": write /var/snap/lxd/common/lxd/virtual-machines/k8s-dev/config/lxd-agent: copy_file_range: no space left on device

the VM k8s-dev lives on ceph.

~
root @ node1 # rbd info LXD/virtual-machine_k8s-dev
rbd image 'virtual-machine_k8s-dev':
        size 95 MiB in 24 objects
        order 22 (4 MiB objects)
        snapshot_count: 0
        id: 1e5d2e267b5d41
        block_name_prefix: rbd_data.1e5d2e267b5d41
        format: 2
        features: layering
        op_features:
        flags:
        create_timestamp: Tue Aug  3 15:15:33 2021
        access_timestamp: Tue Aug  3 15:15:33 2021
        modify_timestamp: Tue Aug  3 15:15:33 2021
        parent: LXD/zombie_image_b1f0967a5c36cf51627a1e99f516c4612fc7ec5595e26cfef43d3e6aca06f35f_ext4@readonly
        overlap: 95 MiB

~
root @ node1 # rbd info LXD/virtual-machine_k8s-dev.block
rbd image 'virtual-machine_k8s-dev.block':
        size 19 GiB in 4769 objects
        order 22 (4 MiB objects)
        snapshot_count: 0
        id: 1e5d284f63913a
        block_name_prefix: rbd_data.1e5d284f63913a
        format: 2
        features: layering
        op_features:
        flags:
        create_timestamp: Tue Aug  3 15:15:33 2021
        access_timestamp: Tue Aug  3 15:15:33 2021
        modify_timestamp: Tue Aug  3 15:15:33 2021
        parent: LXD/zombie_image_b1f0967a5c36cf51627a1e99f516c4612fc7ec5595e26cfef43d3e6aca06f35f_ext4.block@readonly
        overlap: 19 GiB

As you can see, the config image on ceph is 100% full because of the ./state file

~
root @ node1 # rbd -p LXD map virtual-machine_k8s-dev
/dev/rbd4

~
root @ node1 # mount /dev/rbd4 /mnt

~
root @ node1 # df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd4        89M   87M     0 100% /mnt

~
root @ node1 # du -sh /mnt/*
4.0K    /mnt/agent-client.crt
4.0K    /mnt/agent-client.key
4.0K    /mnt/agent.crt
4.0K    /mnt/agent.key
4.0K    /mnt/backup.yaml
10M     /mnt/config
4.0K    /mnt/config.mount
16K     /mnt/lost+found
4.0K    /mnt/metadata.yaml
128K    /mnt/qemu.nvram
77M     /mnt/state
28K     /mnt/templates

Additional information:

this VM, k8s-dev, is the first VM on this node. 3 other instances were containers and started without issue.

Another VM that lives on this host+ceph that has not started yet only uses 12% of the config image on ceph and has no ./state file

The difference between the two VMs is that k8s-dev has migration.stateful: "true" in it's config and the other VMs/containers do not.

I used this VM as a test for the new stateful migration feature, but never got it to work. The command would just wait indefinitely. I forgot about it until now. The state file is from November 17th, a little after VM live migration came available and I started testing it.

-rw-r--r-- 1 root root 79880192 Nov 17 09:47 /mnt/state

Is LXD writing the VMs memory to the ceph config image to transfer it to another host? If so, then a 100mb quota isn't going to be enough?

Information to attach

[ ] Any relevant kernel output (dmesg)
[x] Container log (lxc info NAME --show-log)
[x] Container configuration (lxc config show NAME --expanded)
[x] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
[ ] Output of the client with --debug
[ ] Output of the daemon with --debug (alternatively output of lxc monitor while reproducing the issue)

Name: k8s-dev
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Location: node3
Created: 2021/08/03 15:15 CEST
Last Used: 2021/11/24 08:42 CET
Error: open /var/snap/lxd/common/lxd/logs/k8s-dev/qemu.log: no such file or directory

architecture: x86_64
config:
  boot.autostart.delay: "3"
  boot.host_shutdown_timeout: "90"
  cluster.evacuate: stop
  image.architecture: amd64
  image.description: Ubuntu focal amd64 (20210802_07:42)
  image.os: Ubuntu
  image.release: focal
  image.serial: "20210802_07:42"
  image.type: disk-kvm.img
  image.variant: cloud
  limits.cpu: "6"
  limits.memory: 4GB
  limits.memory.enforce: soft
  migration.stateful: "true"
  user.user-data: |
    #cloud-config
    packages:
      - vim
      - htop
      - facter
      - curl
      - ssh
    users:
      - name: ansible
        groups: sudo
        ssh_authorized_keys:
          - ssh-ed25519 AAAAC3NzaC1lZDI1NYQMZN Ansible Automation 08-2019
        sudo: ALL=(ALL) NOPASSWD:ALL
  volatile.base_image: b1f0967a5c36cf51627a1e99f516c4612fc7ec5595e26cfef43d3e6aca06f35f
  volatile.eth0.hwaddr: 00:16:3e:01:21:d7
  volatile.last_state.power: RUNNING
  volatile.uuid: c312ef06-77e9-4ea2-aced-bb26bfb4afa2
  volatile.vsock_id: "156"
devices:
  eth0:
    nictype: bridged
    parent: br121
    type: nic
  root:
    path: /
    pool: ceph
    type: disk
ephemeral: false
profiles:
- default
- limits.medium
- cloud-init
stateful: false
description: ""

The last entry of /var/snap/lxd/common/lxd/logs/lxd.log is that it succesfully started the previous container from the lxc cluster recover action, so not relevant.

vosdev commented 2 years ago

For now, can I safely remove the state file to get my VMs/containers to restore using the lxc cluster restore command?

stgraber commented 2 years ago

Right, so your stateful migration never failed, instead it was stuck on I/O due to lack of disk space. Then on restart, the instance config disk is so full that LXD can't actually start the instance back up.

Your best bet is to do lxc config device set INSTANCE root size.state 8GiB or something along those lines which will allow for enough space for stateful stop/snapshots/migration and also fix your current startup problem.

vosdev commented 2 years ago

That is a critical oversight on my part :-). It even says so on your newspost! https://discuss.linuxcontainers.org/t/lxd-4-12-has-been-released/10424

Can I request we get an error + automatic clean-up for when this happens instead of a stuck process + a no space left on device at next restart of the VM? Or an initial "If memory > diskspace" then do not even attempt to transfer state

I just changed the size.state device and the VM works now

~
root @ node1 # lxc config device override k8s-dev root
Device root overridden for k8s-dev

~
root @ node1 # lxc config device set k8s-dev root size.state 8GiB

Cheers :-). Now I can also start properly testing/using the live migration feature!

edit: The information was only mentioned in the release notes of 4.12, not 4.20. I have also been unable to find it on the docs

stgraber commented 2 years ago

Yeah, I think it'd be reasonable for us to refuse performing stateful stop/snapshots/migration unless size.state is >= size + limits.memory.

tomponline commented 2 years ago

Yes that would make it more user friendly to discover that setting.

stgraber commented 2 years ago

I think we should do that on startup instead of during config validation as this needs to check:

limits.memory
migration.stateful
size.state on the root device

Mixing that in with profiles and the like could cause a lot of config update failures, so probably best to just fail startup by validating this in Start() of driver_qemu.go

canonical / lxd

Require size.state to be greated than memory for stateful stop/snapshot #9723

Required information

Issue description

Information to attach