MaxRower commented 1 year ago

Required information

Distribution: Ubuntu
Distribution version: 22.04.1 LTS (Jammy Jellyfish)
The output of "lxc info" or if that fails: config: core.https_address: '[::]:8443' core.trust_password: true images.auto_update_interval: "0" api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen api_status: stable api_version: "1.0" auth: trusted public: false auth_methods:
- tls environment: addresses:
- ****:8443 architectures:
- x86_64
- i686 certificate: | -----BEGIN CERTIFICATE-----
  
  -----END CERTIFICATE----- certificate_fingerprint: **** driver: qemu | lxc driver_version: 7.1.0 | 5.0.2 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" shiftfs: "false" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 5.15.0-58-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "22.04" project: default server: lxd server_clustered: false server_event_mode: full-mesh server_name: backup server_pid: 2606 server_version: "5.10" storage: btrfs storage_version: 5.4.1 storage_supported_drivers:
- name: btrfs version: 5.4.1 remote: false
- name: ceph version: 15.2.17 remote: true
- name: cephfs version: 15.2.17 remote: true
- name: cephobject version: 15.2.17 remote: true
- name: dir version: "1" remote: false
- name: lvm version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.45.0 remote: false
- name: zfs version: 2.1.4-0ubuntu0.1 remote: false
- Storage backend in use: btrfs

Issue description

After upgrading most of my servers from lxd 4.0.9 to 5.10, lxc copy --refresh does not transmit only the changes since the last copy, but always transmits all data. Both servers are using btrfs for storage. Data on the target is overwritten completely every time, so any existing deduped reflinks are duplicated again. On lxd 4.0.9 the copy was done via rsync, now it's replaced with btrfs send & receive. For me, this renders lxc copy unusable, since I use it to backup and replicate containers via VPN, and that would need to transfer ~400GB/day, and duplicate any (manually created) btrfs snapshots of my backup history. Transfers between an old lxd 4.0.9 and the new 5.10 still use rsync. Is there a way to get back to old behavior of lxd 4.0.9?

Steps to reproduce

LAN copy ~8.5GB: root@backup:~# time lxc copy remote:container container --refresh --instance-only -c boot.autostart=false -q -s lxd

real 5m6.173s user 0m0.126s sys 0m0.462s

subsequent repetitions only differ slightly.

tomponline commented 1 year ago

This is most likely due to the addition of the optimized transfer feature. What you should do in order to achieve differential transfers is to use lxc snapshot to take a snapshot, and then when transferring using --refresh only the differences between the last snapshot and the current disk contents are transferred.

Does this help speed things up?

MaxRower commented 1 year ago

Well, I don't see what is an "optimized transfer feature", if it's transfers all data every time ;) Sadly, there is no real documentation, on how lxc copy does work, not even a manpage, just --help. I'd like to use local snapshots for local backups before upgrades or other ciritical changes inside a container only, and real longer term backups and snapshots on the backup server only. I don't understand, how a snapshot on the source would help? Should I copy that snapshot to the target and delete it afterwards? Since I am using lxc copy to backup to a dedicated backup server AND to replicate to another hot stand by server as well, at different times, I can't imaging how this should work. In the meantime, I downgraded one backup server to 4.0.9, just had to restore the 4.0.9 config from the last backup of it. Strangely, the lxc remote config was not included there.

tomponline commented 1 year ago

Well, I don't see what is an "optimized transfer feature", if it's transfers all data every time ;)

For ZFS and BTRFS it uses the native (optimized) transfer mechanisms rather than rsync.

Without the --instance-only option, if you had snapshots on the source, then when running lxc copy --refresh it would copy those to the target (only if they were missing, and only using the differential from the previous snapshot) and then transfer the main volume (only as a differential from the latest snapshot).

But with the --instance-only option I'm not sure what the expected behaviour should be here for a storage driver that supports optimized transfers.

@stgraber @monstermunchkin do you think that when doing an --instance-only refresh between optimized storage pools that it should use rsync rather than transfer the whole volume everytime?

tomponline commented 1 year ago

Sadly, there is no real documentation, on how lxc copy does work, not even a manpage, just --help.

Here's a bit about it in the release announcement:

https://discuss.linuxcontainers.org/t/lxd-5-0-lts-has-been-released/13723#optimized-refresh-of-storage-volumes-10

and here:

https://linuxcontainers.org/lxd/docs/master/reference/storage_drivers/#storage-optimized-instance-transfer

tomponline commented 1 year ago

@ru-fu we could probably do with updating https://linuxcontainers.org/lxd/docs/master/reference/storage_drivers/#storage-optimized-instance-transfer with a section describing how this works for instance refreshes.

MaxRower commented 1 year ago

https://discuss.linuxcontainers.org/t/lxd-5-0-lts-has-been-released/13723#optimized-refresh-of-storage-volumes-10

and here:

https://linuxcontainers.org/lxd/docs/master/reference/storage_drivers/#storage-optimized-instance-transfer

Yes, I did read those already. But it only makes sense, if you want to have identical containers on all servers, including their snapshots? I wouldn't want to have all daily snapshots on my regular servers, only on the backup server. And no upgrade-related snapshots on the backup server. And no snapshots at all on the hot standby. Since an lxc copy --instance deletes all snapshots on the target, I do daily snapshotting on the backup server with btrfs snapshot to another directory not touched by lxd. It's only important that it stays deduplicated for as long as possible. Restoring it will involve just moving around those snapshots.

tomponline commented 1 year ago

@monstermunchkin aside from the question about whether LXD should use rsync when using --instance-only with --refresh, I can see there also appear to be a bug with normal pool -> pool optimized refresh for BTRFS.

First lets see what ZFS does:

lxc storage create zfs1 zfs
lxc storage create zfs2 zfs
lxc launch images:ubuntu/jammy c1 -s zfs1

# Perform initial full copy.
time lxc copy c1 c2 --refresh -s zfs2
real    0m0.790s

# Would expect this to (currently) perform full copy again as there are no snapshots.
time lxc copy c1 c2 --refresh -s zfs2
real    0m0.890s

# Now lets add a snapshot and try again.
# We would expect this to take the same time as a full copy too as the missing snapshot needs to be transferred.
lxc snapshot c1
time lxc copy c1 c2 --refresh -s zfs2
real    0m0.895s

# Now lets run the refresh again.
# It should be quicker as it should transfer only the (minimal) differences between the snapshot and main volume.
time lxc copy c1 c2 --refresh -s zfs2
real    0m0.261s

We can see ZFS pool transfers are working correctly when used with snapshots.

Lets try the same now with BTRFS:

lxc storage create btrfs1 btrfs
lxc storage create btrfs2 btrfs
lxc launch images:ubuntu/jammy c1 -s btrfs1

# Perform initial full copy.
time lxc copy c1 c2 --refresh -s btrfs2
real    0m3.070s

# Would expect this to (currently) perform full copy again as there are no snapshots.
time lxc copy c1 c2 --refresh -s btrfs2
real    0m3.041s

# Now lets add a snapshot and try again.
# We would expect this to take the same time as a full copy too as the missing snapshot needs to be transferred.
lxc snapshot c1
time lxc copy c1 c2 --refresh -s btrfs2
real    0m3.273s

# Now lets run the refresh again.
# It should be quicker as it should transfer only the (minimal) differences between the snapshot and main volume.
time lxc copy c1 c2 --refresh -s btrfs2
real    0m3.124s

Oh dear, its the same as doing a full copy again. So optimized refresh appears broken for BTRFS.

tomponline commented 1 year ago

stgraber commented 1 year ago

Hmm, yeah, I think we'd be better off using rsync for containers when using --instance-only and --refresh. For VMs, we should still use the optimized driver as in either case, we're dealing with a full transfer and optimized will be smaller/faster.

tomponline commented 1 year ago

OK thanks, so there's 3 parts to this issue:

[ ] Fix BTRFS optimized refresh (to avoid full copy when there are snapshots).
[ ] Document the concept of optimized refresh.
[ ] Switch to rsync when using --refresh and --instance-only (or if there are no snapshots) for containers only.

ru-fu commented 1 year ago

@ru-fu we could probably do with updating https://linuxcontainers.org/lxd/docs/master/reference/storage_drivers/#storage-optimized-instance-transfer with a section describing how this works for instance refreshes.

@tomp I think I need a bit more input here. ;)

From what I can gather, we're currently using the optimized image transfer (for the drivers that support it) both for the initial copy and a refresh. But the issue is that if we don't have snapshots (or don't want to transfer them), the refresh transfers everything and not only the diff. That sounds like a bug to me and nothing that needs to be documented?

If we now change it to use rsync if there are no snapshots, then we need a doc update that says that even if optimized image transfer is available, we won't use it if there are no snapshots to transfer (I guess because the optimized transfer is more efficient only if we're transferring big files). Is that correct?

tomponline commented 1 year ago

But the issue is that if we don't have snapshots (or don't want to transfer them), the refresh transfers everything and not only the diff. That sounds like a bug to me and nothing that needs to be documented?

Yes indeed, that is a bug, and is the primary concern of this issue. However I think it could be useful to tweak the docs that we have to explain in more detail what the optimized transfer means. Particularly that it depends on having snapshots in the source and that they are transferred as part of the refresh (i.e not using --instance-only mode). As there have been a few examples in the forum of confusion around the --refresh behaviour when going between pools that support optimized transfer. People didn't realise that optimized transfer depends on having snapshots.

If we now change it to use rsync if there are no snapshots, then we need a doc update that says that even if optimized image transfer is available, we won't use it if there are no snapshots to transfer (I guess because the optimized transfer is more efficient only if we're transferring big files). Is that correct?

Yes thats exactly right. The optimized transfer mechanism depends on sending only the differences between snapshots and the main volume. If there are no snapshots (or the user isn't sending them with --instance-only) then we can't rely on the driver level differential approach. Instead for containers and filesystem volumes we will fallback to file-based differential using rsync, and for VMs and block volumes we will continue to transfer the full volume using raw block copies.

ru-fu commented 1 year ago

OK, I attempted to add something, but I'm still not sure I fully understand ...

https://github.com/lxc/lxd/pull/11323

Do the snapshots need to be part of the transfer? (But will we then really save much if a user creates a snapshot right before the transfer?) Or do they just need to exist on the source server? (But they would need to be copied to the target server as well or the diff won't make sense ...) Or do we just need one snapshot on the target server? (But how does the optimized transfer work for the first copy - not refresh - then?)

tomponline commented 1 year ago

Answered on the PR

canonical / lxd

`lxc copy --refresh --instance-only` always copies whole BTRFS subvolume during optimized refresh #11308

Required information

Issue description

Steps to reproduce