Closed slapcat closed 9 months ago
Hi, how do you determine that --refresh
is ignored?
Both the time taken and the live reporting of the data transferred, which shows the full size of the disk being transferred each time. I've used time
to measure the exact transfer times between refreshes and compare those with the same commands on a ZFS-backed pool. The ZFS pool behaves a lot differently and repeated --refresh
copies take almost no time at all, but the same on CephRBD will transfer the full disk each time.
Here are some timings from a test of copying an instance to another node in the same cluster. In both cases, I used the debian/11
image. I took a second snapshot on the source instance before the last refresh, no other changes were made.
Ceph RBD
root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1
real 1m3.129s
user 0m0.055s
sys 0m0.063s
root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1 --refresh
real 14m34.771s
user 0m0.288s
sys 0m0.337s
root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1 --refresh
real 15m11.626s
user 0m0.348s
sys 0m0.302s
root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1 --refresh
real 36m10.129s
user 0m0.679s
sys 0m0.996s
ZFS
root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0
real 0m8.293s
user 0m0.038s
sys 0m0.011s
root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0 --refresh
real 0m1.577s
user 0m0.049s
sys 0m0.081s
root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0 --refresh
real 0m1.229s
user 0m0.042s
sys 0m0.031s
root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0 --refresh
real 0m1.116s
user 0m0.028s
sys 0m0.022s
I was able to reproduce this now with a larger Ceph cluster.
I found both containers and VMs are affected, but it depends if you provide a remote when specifying the target instance. Only when refreshing a container using a remote target LXD looks to behave correctly. @slapcat can you confirm this using a container with a remote?
lxc init ubuntu:jammy c1 && lxc snapshot c1 && lxc snapshot c1
lxc cp c1 c2
-> copies all snapshots including the container, you can check the spawned rsync processes for snap0, snap1 and c1lxc cp c1 c2 --refresh
-> copies snap0, snap1 and c1lxc cp c1 remote:c2 --refresh
-> copies only the container diff to latest snapshotlxc init ubuntu:jammy v1 --vm && lxc snapshot v1 && lxc snapshot v1
lxc cp v1 v2
-> copies snap0, snap1 and v1lxc cp v1 v2 --refresh
-> copies snap0, snap1 and v1lxc cp v1 remote:v2 --refresh
-> copies snap0, snap1 and v1@roosterfish I can confirm the same behavior in my environment.
Following up on the post above, I can narrow it down a bit more.
Using latest/candidate
(which has the fix for https://github.com/canonical/lxd/pull/12632) you can copy/refresh the VM without any issues if you never use a remote in between the refreshes:
lxc init ubuntu:jammy v1 --vm && lxc snapshot v1 && lxc snapshot v1
lxc cp v1 v2 --refresh --target m2
lxc snapshot v1
lxc cp v1 v2 --refresh --target m2
-> almost instantaneouslxc cp v1 remote:v2 --refresh --target m2
-> will copy everything snap0, snap1 and v1lxc cp v1 remote:v2 --refresh --target m2
-> copies only v1lxc cp v1 v2 --refresh --target m2
-> will copy everything snap0, snap1 and v1lxc cp v1 v2 --refresh --target m2
-> copies only v1I am not sure yet what is happening on the backend side but it looks as soon as a remote is used for v2
also the consumed storage capacity on Ceph grows significantly. Maybe the comparison then doesn't work anymore when doing lxc cp v1 v2 --refresh --target m2
so it needs to sync everything?
Update: This is the expected behavior when you mix match the copy with and without a remote pointing to the same host. It "recovers" itself after the first try.
Update: This looks to be a timing issue. When the snapshots on both ends get compared, one of them isn't using the Unix timestamp: https://github.com/canonical/lxd/blob/main/lxd/instance/drivers/driver_qemu.go#L6965
We are facing two inconsistencies here in both LXD and the docs. That seems to be the reason why the refresh sometimes feels "slower" than expected.
Unlike ZFS/Btrfs the Ceph storage driver has to use rsync or bit by bit copy to transfer the data depending on the volume type when refreshing. Since containers use filesystem volumes, rsync can compare the files and transfer only the delta. For VMs the block volumes can only be transferred bit by bit.
Refreshing on the same LXD server (or cluster) uses the --checksum
flag (See https://github.com/canonical/lxd/blob/main/lxd/rsync/rsync.go#L75) for rsync unlike refreshing between LXD servers via the network (See https://github.com/canonical/lxd/blob/main/lxd/rsync/rsync.go#L160).
This means a checksum is computed for each file instead of just checking mod-time and size.
That is the reason why lxc cp c1 c2 --refresh
takes longer than lxc cp c1 remote:c2 --refresh
.
Of course if remote
is on a completely different server this could potentially take longer depending on the network connection.
On the other hand refreshing VMs requires to transfer the entire block volume.
This is performed using dd if=/path/to/rbd/v1 of=/path/to/rbd/v2
on a local system when running lxc cp v1 v2 --refresh
.
It reads the data from Ceph and writes it back to the other volume.
In case the refresh is performed using a remote (lxc cp v1 remote:v2 --refresh
), LXD invokes genericVFSMigrateVolume()
which reads the VMs block volume and sends it via websocket to the target by simply reading the file and copying it to the opened connection (See https://github.com/canonical/lxd/blob/main/lxd/storage/drivers/generic_vfs.go#L201).
In my opinion both the local and remote rsync should use the same parameters. It could be that we are missing something else so I am hesitant suggesting to remove the --checksum
from the local rsync to speed it up. Contrary to this adding the --checksum
to the remote rsync would mean we would slow things down.
The docs for storage drivers should state that the Ceph driver has to perform full copies in case of a VM refresh. Just because it says optimized volume transfer doesn't mean the highest optimum is reached. But it does transfer only the last snapshot (in case there is a new one) or just the instance itself. In the docs it says:
Btrfs, ZFS and Ceph RBD have an internal send/receive mechanism that allows for optimized volume transfer.
But from the code it looks that this only applies for ZFS and Btrfs due to their own send/receive functions. Only when copying instances for the first time the Ceph rbd import/export-diff
functions are used.
The PRs #12708, #12715 and #12720 address all the findings from this issue. Adding support for optimized Ceph RBD volume refreshes is now tracked here https://github.com/canonical/lxd/issues/12721.
The original issue is described separately in https://github.com/canonical/lxd/issues/12721. We can close this issue as the other findings reported here are already fixed.
Required information
Distribution version: 22.04
lxc info
``` # lxc info config: core.https_address: '[::]:8443' api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - macaroon_authentication - network_sriov - console - restrict_devlxd - migration_pre_copy - infiniband - maas_network - devlxd_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - devlxd_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - candid_authentication - backup_compression - candid_config - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - candid_config_key - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - rbac - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache - network_load_balancer - vsock_api - instance_ready_state - network_bgp_holdtime - storage_volumes_all_projects - metrics_memory_oom_total - storage_buckets - storage_buckets_create_credentials - metrics_cpu_effective_total - projects_networks_restricted_access - storage_buckets_local - loki - acme - internal_metrics - cluster_join_token_expiry - remote_token_expiry - init_preseed - storage_volumes_created_at - cpu_hotplug - projects_networks_zones - network_txqueuelen - cluster_member_state - instances_placement_scriptlet - storage_pool_source_wipe - zfs_block_mode - instance_generation_id - disk_io_cache - amd_sev - storage_pool_loop_resize - migration_vm_live - ovn_nic_nesting - oidc - network_ovn_l3only - ovn_nic_acceleration_vdpa - cluster_healing - instances_state_total - auth_user - security_csm - instances_rebuild - numa_cpu_placement - custom_volume_iso - network_allocations - storage_api_remote_volume_snapshot_copy - zfs_delegate - operations_get_query_all_projects - metadata_configuration - syslog_socket - event_lifecycle_name_and_project - instances_nic_limits_priority - disk_initial_volume_configuration - operation_wait api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls auth_user_name: root auth_user_method: unix environment: addresses: - 10.10.0.110:8443 - 10.78.194.1:8443 - '[fd42:6c89:1917:3e3e::1]:8443' architectures: - x86_64 - i686 certificate: | -----BEGIN CERTIFICATE----- MIIB2jCCAWCgAwIBAgIQM1z+YhrtOiajMPCFhkqbjDAKBggqhkjOPQQDAzAhMQww CgYDVQQKEwNMWEQxETAPBgNVBAMMCHJvb3RAbWMxMB4XDTIzMTIxMTE1MDAwMVoX DTMzMTIwODE1MDAwMVowITEMMAoGA1UEChMDTFhEMREwDwYDVQQDDAhyb290QG1j MTB2MBAGByqGSM49AgEGBSuBBAAiA2IABEp98HqcVbkZLmd+q+5+7Gj5Qc2ZHcdH 88FFXk6JHqurI5ceVrANzu+3/0a0vs0izclx7vtvB0exycFfUHFh0YVqMFo6IctZ H5tQmASy94nqlgSVo6ajt9LLf1Qj+WF5AaNdMFswDgYDVR0PAQH/BAQDAgWgMBMG A1UdJQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwJgYDVR0RBB8wHYIDbWMx hwR/AAABhxAAAAAAAAAAAAAAAAAAAAABMAoGCCqGSM49BAMDA2gAMGUCMQCb1ZuS ZW2tReo0KWtDqXb+k+67IDRMce/G4B3OkV+X16WN9WMlyW68SwcwJ9KJQPsCMDQI 1v9AB/VyFzXPCm9KNJC+FzdDJQ+Vj8drRJvih0NON01uW8DLpsvd+ghgU1NhUA== -----END CERTIFICATE----- certificate_fingerprint: f660df1ce5d8f5f0b2c6916651db9e76b183deed39ebf6b54d5e099bf0ab4db4 driver: lxc | qemu driver_version: 5.0.3 | 8.1.1 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" shiftfs: "false" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 5.15.0-89-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "22.04" project: default server: lxd server_clustered: false server_event_mode: full-mesh server_name: mc1 server_pid: 1069 server_version: "5.19" storage: ceph storage_version: 17.2.6 storage_supported_drivers: - name: ceph version: 17.2.6 remote: true - name: cephfs version: 17.2.6 remote: true - name: cephobject version: 17.2.6 remote: true - name: dir version: "1" remote: false - name: lvm version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0 remote: false - name: zfs version: 2.1.5-1ubuntu6~22.04.1 remote: false - name: btrfs version: 5.16.2 remote: false ```Issue description
When doing
lxc cp --refresh
on CephRBD-backed instances, it transfers the entire disk instead of just the delta between snapshots. It happens regardless of changes to the filesystem or snapshots of the source instance. This does not happen for containers or for other storage backends like ZFS. This has been tested when copying to a remote because at the time I was prevented from testing copying to a cluster member node by bug #12631.Steps to reproduce
lxc launch images:debian/12 --vm v1
lxc snapshot v1
lxc cp v1 remote:v1
lxc cp v1 remote:v1 --refresh
lxc snapshot v1
lxc cp v1 remote:v1 --refresh
lxc cp v1 remote:v1 --refresh