slapcat commented 11 months ago

Required information

Distribution: Ubuntu
Distribution version: 22.04

lxc info
``` # lxc info config: core.https_address: '[::]:8443' api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - macaroon_authentication - network_sriov - console - restrict_devlxd - migration_pre_copy - infiniband - maas_network - devlxd_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - devlxd_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - candid_authentication - backup_compression - candid_config - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - candid_config_key - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - rbac - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache - network_load_balancer - vsock_api - instance_ready_state - network_bgp_holdtime - storage_volumes_all_projects - metrics_memory_oom_total - storage_buckets - storage_buckets_create_credentials - metrics_cpu_effective_total - projects_networks_restricted_access - storage_buckets_local - loki - acme - internal_metrics - cluster_join_token_expiry - remote_token_expiry - init_preseed - storage_volumes_created_at - cpu_hotplug - projects_networks_zones - network_txqueuelen - cluster_member_state - instances_placement_scriptlet - storage_pool_source_wipe - zfs_block_mode - instance_generation_id - disk_io_cache - amd_sev - storage_pool_loop_resize - migration_vm_live - ovn_nic_nesting - oidc - network_ovn_l3only - ovn_nic_acceleration_vdpa - cluster_healing - instances_state_total - auth_user - security_csm - instances_rebuild - numa_cpu_placement - custom_volume_iso - network_allocations - storage_api_remote_volume_snapshot_copy - zfs_delegate - operations_get_query_all_projects - metadata_configuration - syslog_socket - event_lifecycle_name_and_project - instances_nic_limits_priority - disk_initial_volume_configuration - operation_wait api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls auth_user_name: root auth_user_method: unix environment: addresses: - 10.10.0.110:8443 - 10.78.194.1:8443 - '[fd42:6c89:1917:3e3e::1]:8443' architectures: - x86_64 - i686 certificate: | -----BEGIN CERTIFICATE----- MIIB2jCCAWCgAwIBAgIQM1z+YhrtOiajMPCFhkqbjDAKBggqhkjOPQQDAzAhMQww CgYDVQQKEwNMWEQxETAPBgNVBAMMCHJvb3RAbWMxMB4XDTIzMTIxMTE1MDAwMVoX DTMzMTIwODE1MDAwMVowITEMMAoGA1UEChMDTFhEMREwDwYDVQQDDAhyb290QG1j MTB2MBAGByqGSM49AgEGBSuBBAAiA2IABEp98HqcVbkZLmd+q+5+7Gj5Qc2ZHcdH 88FFXk6JHqurI5ceVrANzu+3/0a0vs0izclx7vtvB0exycFfUHFh0YVqMFo6IctZ H5tQmASy94nqlgSVo6ajt9LLf1Qj+WF5AaNdMFswDgYDVR0PAQH/BAQDAgWgMBMG A1UdJQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwJgYDVR0RBB8wHYIDbWMx hwR/AAABhxAAAAAAAAAAAAAAAAAAAAABMAoGCCqGSM49BAMDA2gAMGUCMQCb1ZuS ZW2tReo0KWtDqXb+k+67IDRMce/G4B3OkV+X16WN9WMlyW68SwcwJ9KJQPsCMDQI 1v9AB/VyFzXPCm9KNJC+FzdDJQ+Vj8drRJvih0NON01uW8DLpsvd+ghgU1NhUA== -----END CERTIFICATE----- certificate_fingerprint: f660df1ce5d8f5f0b2c6916651db9e76b183deed39ebf6b54d5e099bf0ab4db4 driver: lxc | qemu driver_version: 5.0.3 | 8.1.1 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" shiftfs: "false" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 5.15.0-89-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "22.04" project: default server: lxd server_clustered: false server_event_mode: full-mesh server_name: mc1 server_pid: 1069 server_version: "5.19" storage: ceph storage_version: 17.2.6 storage_supported_drivers: - name: ceph version: 17.2.6 remote: true - name: cephfs version: 17.2.6 remote: true - name: cephobject version: 17.2.6 remote: true - name: dir version: "1" remote: false - name: lvm version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0 remote: false - name: zfs version: 2.1.5-1ubuntu6~22.04.1 remote: false - name: btrfs version: 5.16.2 remote: false ```

Issue description

When doing lxc cp --refresh on CephRBD-backed instances, it transfers the entire disk instead of just the delta between snapshots. It happens regardless of changes to the filesystem or snapshots of the source instance. This does not happen for containers or for other storage backends like ZFS. This has been tested when copying to a remote because at the time I was prevented from testing copying to a cluster member node by bug #12631.

Steps to reproduce

Setup microceph.
Setup LXD and add a ceph-rbd storage pool.
lxc launch images:debian/12 --vm v1
lxc snapshot v1
lxc cp v1 remote:v1
lxc cp v1 remote:v1 --refresh
lxc snapshot v1
lxc cp v1 remote:v1 --refresh
lxc cp v1 remote:v1 --refresh

roosterfish commented 11 months ago

Hi, how do you determine that --refresh is ignored?

slapcat commented 11 months ago

Both the time taken and the live reporting of the data transferred, which shows the full size of the disk being transferred each time. I've used time to measure the exact transfer times between refreshes and compare those with the same commands on a ZFS-backed pool. The ZFS pool behaves a lot differently and repeated --refresh copies take almost no time at all, but the same on CephRBD will transfer the full disk each time.

slapcat commented 10 months ago

Here are some timings from a test of copying an instance to another node in the same cluster. In both cases, I used the debian/11 image. I took a second snapshot on the source instance before the last refresh, no other changes were made.

Ceph RBD

root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1

real    1m3.129s
user    0m0.055s
sys 0m0.063s
root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1 --refresh

real    14m34.771s
user    0m0.288s
sys 0m0.337s
root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1 --refresh

real    15m11.626s
user    0m0.348s
sys 0m0.302s
root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1 --refresh

real    36m10.129s
user    0m0.679s
sys     0m0.996s

ZFS

root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0

real    0m8.293s
user    0m0.038s
sys 0m0.011s
root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0 --refresh

real    0m1.577s
user    0m0.049s
sys 0m0.081s
root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0 --refresh

real    0m1.229s
user    0m0.042s
sys 0m0.031s
root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0 --refresh

real    0m1.116s
user    0m0.028s
sys 0m0.022s

roosterfish commented 10 months ago

I was able to reproduce this now with a larger Ceph cluster.

I found both containers and VMs are affected, but it depends if you provide a remote when specifying the target instance. Only when refreshing a container using a remote target LXD looks to behave correctly. @slapcat can you confirm this using a container with a remote?

Container

lxc init ubuntu:jammy c1 && lxc snapshot c1 && lxc snapshot c1
lxc cp c1 c2 -> copies all snapshots including the container, you can check the spawned rsync processes for snap0, snap1 and c1
lxc cp c1 c2 --refresh -> copies snap0, snap1 and c1
lxc cp c1 remote:c2 --refresh -> copies only the container diff to latest snapshot

VM

lxc init ubuntu:jammy v1 --vm && lxc snapshot v1 && lxc snapshot v1
lxc cp v1 v2 -> copies snap0, snap1 and v1
lxc cp v1 v2 --refresh -> copies snap0, snap1 and v1
lxc cp v1 remote:v2 --refresh -> copies snap0, snap1 and v1

slapcat commented 10 months ago

@roosterfish I can confirm the same behavior in my environment.

roosterfish commented 10 months ago

Following up on the post above, I can narrow it down a bit more.

Using latest/candidate (which has the fix for https://github.com/canonical/lxd/pull/12632) you can copy/refresh the VM without any issues if you never use a remote in between the refreshes:

lxc init ubuntu:jammy v1 --vm && lxc snapshot v1 && lxc snapshot v1
lxc cp v1 v2 --refresh --target m2
lxc snapshot v1
lxc cp v1 v2 --refresh --target m2 -> almost instantaneous
lxc cp v1 remote:v2 --refresh --target m2 -> will copy everything snap0, snap1 and v1
lxc cp v1 remote:v2 --refresh --target m2 -> copies only v1
lxc cp v1 v2 --refresh --target m2 -> will copy everything snap0, snap1 and v1
lxc cp v1 v2 --refresh --target m2 -> copies only v1

I am not sure yet what is happening on the backend side but it looks as soon as a remote is used for v2 also the consumed storage capacity on Ceph grows significantly. Maybe the comparison then doesn't work anymore when doing lxc cp v1 v2 --refresh --target m2 so it needs to sync everything?

~~Update: This is the expected behavior when you mix match the copy with and without a remote pointing to the same host. It "recovers" itself after the first try.~~

Update: This looks to be a timing issue. When the snapshots on both ends get compared, one of them isn't using the Unix timestamp: https://github.com/canonical/lxd/blob/main/lxd/instance/drivers/driver_qemu.go#L6965

roosterfish commented 10 months ago

We are facing two inconsistencies here in both LXD and the docs. That seems to be the reason why the refresh sometimes feels "slower" than expected.

Unlike ZFS/Btrfs the Ceph storage driver has to use rsync or bit by bit copy to transfer the data depending on the volume type when refreshing. Since containers use filesystem volumes, rsync can compare the files and transfer only the delta. For VMs the block volumes can only be transferred bit by bit.

Containers

Refreshing on the same LXD server (or cluster) uses the --checksum flag (See https://github.com/canonical/lxd/blob/main/lxd/rsync/rsync.go#L75) for rsync unlike refreshing between LXD servers via the network (See https://github.com/canonical/lxd/blob/main/lxd/rsync/rsync.go#L160). This means a checksum is computed for each file instead of just checking mod-time and size. That is the reason why lxc cp c1 c2 --refresh takes longer than lxc cp c1 remote:c2 --refresh.

Of course if remote is on a completely different server this could potentially take longer depending on the network connection.

VMs

On the other hand refreshing VMs requires to transfer the entire block volume. This is performed using dd if=/path/to/rbd/v1 of=/path/to/rbd/v2 on a local system when running lxc cp v1 v2 --refresh. It reads the data from Ceph and writes it back to the other volume.

In case the refresh is performed using a remote (lxc cp v1 remote:v2 --refresh), LXD invokes genericVFSMigrateVolume() which reads the VMs block volume and sends it via websocket to the target by simply reading the file and copying it to the opened connection (See https://github.com/canonical/lxd/blob/main/lxd/storage/drivers/generic_vfs.go#L201).

Potential fixes

In my opinion both the local and remote rsync should use the same parameters. It could be that we are missing something else so I am hesitant suggesting to remove the --checksum from the local rsync to speed it up. Contrary to this adding the --checksum to the remote rsync would mean we would slow things down.
The docs for storage drivers should state that the Ceph driver has to perform full copies in case of a VM refresh. Just because it says optimized volume transfer doesn't mean the highest optimum is reached. But it does transfer only the last snapshot (in case there is a new one) or just the instance itself. In the docs it says:

Btrfs, ZFS and Ceph RBD have an internal send/receive mechanism that allows for optimized volume transfer.

But from the code it looks that this only applies for ZFS and Btrfs due to their own send/receive functions. Only when copying instances for the first time the Ceph rbd import/export-diff functions are used.

Use the right timezone when comparing the snapshots on both ends, see https://github.com/canonical/lxd/issues/12668#issuecomment-1877419923

roosterfish commented 10 months ago

The PRs #12708, #12715 and #12720 address all the findings from this issue. Adding support for optimized Ceph RBD volume refreshes is now tracked here https://github.com/canonical/lxd/issues/12721.

roosterfish commented 9 months ago

The original issue is described separately in https://github.com/canonical/lxd/issues/12721. We can close this issue as the other findings reported here are already fixed.

canonical / lxd

CephRBD-backed instances do not respect `--refresh` when copied #12668

Required information

Issue description

Steps to reproduce

Container

VM

Containers

VMs

Potential fixes