apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
1.83k stars 1.07k forks source link

KVM incremental snapshot feature #9270

Open JoaoJandre opened 1 week ago

JoaoJandre commented 1 week ago

Description

This PR solves issue #8907.

Currently, when taking a volume snapshot/backup with KVM as the hypervisor, it is always a full snapshot/backup. However, always taking full snapshots of volumes is costly for both the storage network and storage systems. To solve the aforementioned issues, this PR extends the volume snapshot feature in KVM, allowing users to create incremental volume snapshots using KVM as a hypervisor.

To give operators control over which type of snapshot is being created, a new global setting kvm.incremental.snapshot has been added, which can be changed at zone and cluster scopes; this setting is false by default. Also, the snapshot.delta.max configuration, used to control the maximum deltas when using XenServer, was extended to also limit the size of the backing chain of snapshots on primary/secondary storage.

This functionality is only available in environments with Libvirt 7.6.0+ and qemu 6.1+. If the kvm.incremental.snapshot setting is true, and the hosts do not have the required Libvirt and qemu versions, an error will be thrown when trying to take a snapshot. Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC that require a shared mount-point storage file system for KVM such as OCFS2 or GlusterFS), NFS, and local storage. Other storage types for KVM, such as CLVM and RBD, need different approaches to enable incremental backups; therefore, these are not currently supported.

Issue #8907 has more details and flowcharts of all the mapped workflows.

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Description of tests

During testing, the kvm.incremental.snapshot setting was changed to true and the snapshot.delta.max setting was changed to 3.

Tests with snapshot.backup.to.secondary = false

For the tests in this section, a test VM was created and reused for all tests.

Snapshot creation tests

Test Result
Access the VM, create any file in it and create volume snapshot 1 while the VM running Full snapshot created
Access the VM, create a second file in it, create volume snapshot 2 while the VM running Incremental snapshot created with correct size and backing chain (snapshot 1)
Stop the VM and create volume snapshot 3 Correctly created incremental snapshot
Start the VM again, create volume snapshot 4 Full snapshot created
Migrate the VM and create volume snapshot 5 Incremental snapshot created from snapshot 4
Migrate VM + ROOT volume Exception

Snapshot restore tests

Test Result
Access the VM, delete all previously created files, stop the VM, restore snapshot 1 and start the VM again Restoration correctly performed, the file created in snapshot creation test 1 was present on the volume
Access the VM, delete the file restored in snapshot restore test 2, stop the VM, restore snapshot 2 and start the VM again Restoration correctly performed, the files created in tests 1 and 2 of snapshot creation were present on the volume

Snapshot removal tests

Test Result
Delete snapshot 5 Snapshot deleted and removed from storage
Delete snapshot 1 Snapshot deleted and not removed from storage
Delete snapshots 2 and 3 Snapshots deleted and removed from storage; furthermore, snapshot 1 was also removed from storage

Template creation test

# Test Result
1 Create template from snapshot 4 and create a VM using the template Template created correctly, VM had the files created in the original VM

Tests with snapshot.backup.to.secondary = true

All tests performed in the previous sections were repeated with snapshot.backup.to.secondary = false, in addition, two additional tests were performed. For the tests in this section, a test VM was created and reused for all tests.

Snapshot creation tests

N Test Result
1 Migrate the VM + ROOT volume and take snapshot 6 Migration carried out and full snapshot created
2 Stop the VM, migrate the volume and take snapshot 7 Volume migration performed and incremental snapshot created from snapshot 6
codecov[bot] commented 1 week ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 4.17%. Comparing base (de683a5) to head (717d330).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #9270 +/- ## ===================================== Coverage 4.17% 4.17% ===================================== Files 371 371 Lines 30407 30407 Branches 5384 5384 ===================================== Hits 1269 1269 Misses 28994 28994 Partials 144 144 ``` | [Flag](https://app.codecov.io/gh/apache/cloudstack/pull/9270/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | | |---|---|---| | [uitests](https://app.codecov.io/gh/apache/cloudstack/pull/9270/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `4.17% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

weizhouapache commented 1 week ago

Good job @JoaoJandre

DaanHoogland commented 1 week ago

Good job @JoaoJandre

second that, tnx

DaanHoogland commented 1 week ago

not gotten through all of it yet but looks good so far.

DaanHoogland commented 1 week ago

@blueorangutan package

blueorangutan commented 1 week ago

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan commented 1 week ago

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10011

DaanHoogland commented 1 week ago

@blueorangutan test

blueorangutan commented 1 week ago

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan commented 1 week ago

[SF] Trillian test result (tid-10507) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 53282 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9270-t10507-kvm-centos7.zip Smoke tests completed. 111 look OK, 23 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_role_account_acls_multiple_mgmt_servers Error 2.28 test_dynamicroles.py
test_query_async_job_result Error 102.03 test_async_job.py
test_revoke_certificate Error 0.01 test_certauthority_root.py
test_configure_ha_provider_invalid Error 0.02 test_hostha_simulator.py
test_configure_ha_provider_valid Error 0.01 test_hostha_simulator.py
test_ha_configure_enabledisable_across_clusterzones Error 0.01 test_hostha_simulator.py
test_ha_disable_feature_invalid Error 0.01 test_hostha_simulator.py
test_ha_enable_feature_invalid Error 0.01 test_hostha_simulator.py
test_ha_list_providers Error 0.01 test_hostha_simulator.py
test_ha_multiple_mgmt_server_ownership Error 0.01 test_hostha_simulator.py
test_ha_verify_fsm_available Error 0.01 test_hostha_simulator.py
test_ha_verify_fsm_degraded Error 0.01 test_hostha_simulator.py
test_ha_verify_fsm_fenced Error 0.01 test_hostha_simulator.py
test_ha_verify_fsm_recovering Error 0.01 test_hostha_simulator.py
test_hostha_configure_default_driver Error 0.01 test_hostha_simulator.py
test_hostha_configure_invalid_provider Error 0.01 test_hostha_simulator.py
test_hostha_disable_feature_valid Error 0.01 test_hostha_simulator.py
test_hostha_enable_feature_valid Error 0.01 test_hostha_simulator.py
test_hostha_enable_feature_without_setting_provider Error 0.01 test_hostha_simulator.py
test_list_ha_for_host Error 0.01 test_hostha_simulator.py
test_list_ha_for_host_invalid Error 0.01 test_hostha_simulator.py
test_list_ha_for_host_valid Error 0.01 test_hostha_simulator.py
test_01_host_ping_on_alert Error 0.08 test_host_ping.py
test_01_host_ping_on_alert Error 0.08 test_host_ping.py
test_01_browser_migrate_template Error 15.32 test_image_store_object_migration.py
test_01_invalid_upgrade_kubernetes_cluster Failure 251.12 test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster Failure 241.89 test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster Failure 241.81 test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster Failure 231.70 test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster Failure 222.51 test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster Failure 243.68 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 347.95 test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster Failure 232.24 test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle Error 91.64 test_kubernetes_clusters.py
test_01_add_delete_kubernetes_supported_version Error 0.14 test_kubernetes_supported_versions.py
ContextSuite context=TestListIdsParams>:teardown Error 1.12 test_list_ids_parameter.py
login_test_saml_user Error 3.20 test_login.py
test_01_deployVMInSharedNetwork Error 77.64 test_network.py
test_03_destroySharedNetwork Failure 1.07 test_network.py
ContextSuite context=TestSharedNetwork>:teardown Error 2.17 test_network.py
test_oobm_issue_power_cycle Error 2.31 test_outofbandmanagement_nestedplugin.py
test_oobm_issue_power_off Error 3.31 test_outofbandmanagement_nestedplugin.py
test_oobm_issue_power_on Error 3.33 test_outofbandmanagement_nestedplugin.py
test_oobm_issue_power_reset Error 3.34 test_outofbandmanagement_nestedplugin.py
test_oobm_issue_power_soft Error 3.30 test_outofbandmanagement_nestedplugin.py
test_oobm_issue_power_status Error 1.22 test_outofbandmanagement_nestedplugin.py
test_oobm_background_powerstate_sync Failure 21.68 test_outofbandmanagement.py
test_oobm_background_powerstate_sync Error 21.69 test_outofbandmanagement.py
test_oobm_configure_default_driver Error 0.06 test_outofbandmanagement.py
test_oobm_configure_invalid_driver Error 0.05 test_outofbandmanagement.py
test_oobm_disable_feature_invalid Error 0.05 test_outofbandmanagement.py
test_oobm_disable_feature_valid Error 1.15 test_outofbandmanagement.py
test_oobm_enable_feature_invalid Error 0.04 test_outofbandmanagement.py
test_oobm_enable_feature_valid Error 1.11 test_outofbandmanagement.py
test_oobm_enabledisable_across_clusterzones Error 11.88 test_outofbandmanagement.py
test_oobm_enabledisable_across_clusterzones Error 11.88 test_outofbandmanagement.py
test_oobm_issue_power_cycle Error 4.34 test_outofbandmanagement.py
test_oobm_issue_power_cycle Error 4.34 test_outofbandmanagement.py
test_oobm_issue_power_off Error 4.35 test_outofbandmanagement.py
test_oobm_issue_power_off Error 4.35 test_outofbandmanagement.py
test_oobm_issue_power_on Error 2.32 test_outofbandmanagement.py
test_oobm_issue_power_on Error 2.33 test_outofbandmanagement.py
test_oobm_issue_power_reset Error 4.33 test_outofbandmanagement.py
test_oobm_issue_power_reset Error 4.33 test_outofbandmanagement.py
test_oobm_issue_power_soft Error 4.34 test_outofbandmanagement.py
test_oobm_issue_power_soft Error 4.34 test_outofbandmanagement.py
test_oobm_issue_power_status Error 4.36 test_outofbandmanagement.py
test_oobm_issue_power_status Error 4.36 test_outofbandmanagement.py
test_oobm_multiple_mgmt_server_ownership Error 1.16 test_outofbandmanagement.py
test_oobm_multiple_mgmt_server_ownership Error 1.16 test_outofbandmanagement.py
test_oobm_zchange_password Error 2.27 test_outofbandmanagement.py
test_oobm_zchange_password Error 2.27 test_outofbandmanagement.py
test_02_edit_primary_storage_tags Error 0.01 test_primary_storage.py
test_01_vpc_privategw_acl Error 0.03 test_privategw_acl_ovs_gre.py
test_03_vpc_privategw_restart_vpc_cleanup Error 0.02 test_privategw_acl_ovs_gre.py
test_05_vpc_privategw_check_interface Error 0.02 test_privategw_acl_ovs_gre.py
test_01_vpc_privategw_acl Error 53.57 test_privategw_acl.py
test_02_vpc_privategw_static_routes Error 213.98 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Error 209.22 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Error 337.92 test_privategw_acl.py
test_01_snapshot_root_disk Error 1.14 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 50.05 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 50.05 test_snapshots.py
ContextSuite context=TestSnapshotStandaloneBackup>:setup Error 170.10 test_snapshots.py
test_CreateTemplateWithDuplicateName Error 21.75 test_templates.py
test_01_register_template_direct_download_flag Error 0.16 test_templates.py
test_01_positive_tests_usage Error 10.51 test_usage_events.py
test_01_ISO_usage Error 1.08 test_usage.py
test_01_lb_usage Error 4.25 test_usage.py
test_01_nat_usage Error 8.33 test_usage.py
test_01_public_ip_usage Error 1.07 test_usage.py
test_01_snapshot_usage Error 3.18 test_usage.py
test_01_template_usage Error 13.47 test_usage.py
test_01_vm_usage Error 134.27 test_usage.py
test_01_volume_usage Error 125.61 test_usage.py
test_01_vpn_usage Error 9.58 test_usage.py
test_12_start_vm_multiple_volumes_allocated Error 10.54 test_vm_life_cycle.py
test_01_vmschedule_create Error 0.09 test_vm_schedule.py
test_disable_oobm_ha_state_ineligible Error 0.05 test_hostha_kvm.py
test_hostha_configure_default_driver Error 0.04 test_hostha_kvm.py
test_hostha_enable_ha_when_host_disabled Error 0.04 test_hostha_kvm.py
test_hostha_enable_ha_when_host_disconected Error 0.04 test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance Error 0.06 test_hostha_kvm.py
test_hostha_kvm_host_degraded Error 0.04 test_hostha_kvm.py
test_hostha_kvm_host_fencing Error 0.04 test_hostha_kvm.py
test_hostha_kvm_host_recovering Error 0.04 test_hostha_kvm.py
test_remove_ha_provider_not_possible Error 0.04 test_hostha_kvm.py
weizhouapache commented 1 week ago

@blueorangutan test rocky8 kvm-rocky8

blueorangutan commented 1 week ago

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

alexandremattioli commented 1 week ago

@JoaoJandre nice one. Just one remark, the following sentence sounds contradictory to me "Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC)", if it supports iSCSI and FC (through a shared mountpoint) it does support block storage, I think the phrasing could cause some confusion as to which types of storage are supported.

JoaoJandre commented 1 week ago

@JoaoJandre nice one. Just one remark, the following sentence sounds contradictory to me "Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC)", if it supports iSCSI and FC (through a shared mountpoint) it does support block storage, I think the phrasing could cause some confusion as to which types of storage are supported.

Hey @alexandremattioli, I understand your confusion. However, when using shared mount-point, as far as ACS is concerned, the storage is file-based, we will not be working with blocks directly, only files (as ACS does already for shared mount point). The mentions on parenthesis are there to give an example of underlying storages that might be behind the shared mount-point.

I have updated the description to add a little more context.

blueorangutan commented 1 week ago

[SF] Trillian test result (tid-10523) Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8 Total time taken: 47815 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9270-t10523-kvm-rocky8.zip Smoke tests completed. 131 look OK, 3 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestListIdsParams>:teardown Error 1.15 test_list_ids_parameter.py
test_01_snapshot_root_disk Error 6.17 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 49.21 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 49.21 test_snapshots.py
ContextSuite context=TestSnapshotStandaloneBackup>:teardown Error 60.81 test_snapshots.py
test_01_snapshot_usage Error 26.05 test_usage.py
github-actions[bot] commented 6 days ago

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] commented 2 days ago

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

DaanHoogland commented 7 hours ago

hm, seen this a couple of times now; the bot removes and adds the has-conflicts labels in the same second and a PR without conflicts ends out being marked as having so :(