Open JoaoJandre opened 7 months ago
Thanks for opening your first issue here! Be sure to follow the issue template!
The problems with live migration are being fixed by #8908, #8909, #8911 and #8952
Impressive design @JoaoJandre . I scanned through it , so I may have missed this; is any coalescing occurring on the backed up snapshots for a VM? i.e. with snapshots 1,2,3 the user want to delete snapshot 2 in your chapter on deletion (see deletion diagram ) the snapshot remains but becomes hidden because the DB entry is deleted. when the volume is completely deleted or the user want the create a snapshot, How will the 'hidden' increment be handled by cloudstack?
Impressive design @JoaoJandre . I scanned through it , so I may have missed this; is any coalescing occurring on the backed up snapshots for a VM? i.e. with snapshots 1,2,3 the user want to delete snapshot 2 in your chapter on deletion (see deletion diagram ) the snapshot remains but becomes hidden because the DB entry is deleted. when the volume is completely deleted or the user want the create a snapshot, How will the 'hidden' increment be handled by cloudstack?
@DaanHoogland, there will be no coalescing on the backend, except when the user decides to restore the snapshot (see section 2.2). Also, the DB entry will not be deleted, the entry on the snapshots
table will be marked as destroyed, so that it is not listed, but the entry on snapshot_store_ref
will remain, this way we will not lose the reference to where the snapshot is.
In your example, if the snapshots are on primary storage, when the volume is deleted, all of snapshots will get deleted as well (this is the current behavior); if the snapshots are on secondary storage, they will remain (this is also the current behavior), and the rules described in deletion diagram still apply.
About the snapshot creation after the user deletes one of the snapshots in the middle of the chain, this will not have any relevant effect. The snapshot can still be created as usual (all the files are unchanged); the size of the snapshot chain will be calculated using the parent_snapshot_id
column on snapshot_store_ref
(which, as explained before, will not be affected by the snapshot deletion on this case). The snapshot chain size calculation is actually already implemented in the code for Xen (see
https://github.com/apache/cloudstack/blob/8ff2c018cc5b3fc69bcd8756695d04b384e46ab8/engine/storage/snapshot/src/main/java/org/apache/cloudstack/storage/snapshot/DefaultSnapshotStrategy.java#L152) we can use a similar logic for the KVM implementation.
Ok, one more, will coalescing of snapshots on secondary be implemented at all? (and do we think we need it?)
Ok, one more, will coalescing of snapshots on secondary be implemented at all? (and do we think we need it?)
I don't see a reason to coalesce snapshots on secondary. We would be losing the option to revert to a snapshot in the middle of the chain.
The only exception is when creating a template/volume from snapshot, in this case, to not mess too much with the current implementation, we will create a temporary coalesced snapshot before sending the command to the SSVM. But after the template/volume is created, the temporary coalesced snapshot will be removed.
Ok, one more, will coalescing of snapshots on secondary be implemented at all? (and do we think we need it?)
I don't see a reason to coalesce snapshots on secondary. We would be losing the option to revert to a snapshot in the middle of the chain.
The only exception is when creating a template/volume from snapshot, in this case, to not mess too much with the current implementation, we will create a temporary coalesced snapshot before sending the command to the SSVM. But after the template/volume is created, the temporary coalesced snapshot will be removed.
one thing I was thinking of is when chains become very long , e.g. an automatic snapshot with scheduled deletes. Just a thought though, we can burn that bridge ones we try to cross it.
Ok, one more, will coalescing of snapshots on secondary be implemented at all? (and do we think we need it?)
I don't see a reason to coalesce snapshots on secondary. We would be losing the option to revert to a snapshot in the middle of the chain. The only exception is when creating a template/volume from snapshot, in this case, to not mess too much with the current implementation, we will create a temporary coalesced snapshot before sending the command to the SSVM. But after the template/volume is created, the temporary coalesced snapshot will be removed.
one thing I was thinking of is when chains become very long , e.g. an automatic snapshot with scheduled deletes. Just a thought though, we can burn that bridge ones we try to cross it.
The chain size will be controlled by snapshot.delta.max
(which has a pretty high default value in my opinion). When the chain reaches this threshold, a new one will be created, that is, a new full snapshot will be taken and the next snapshots will be based on this one.
About the snapshot policies, this is covered a little by section 2.1.1. But when the maxSnaps
is reached, the old snapshots will start to get removed (still following the deletion rules on section 2.4).
Ideally, incremental backups, or a base snapshot and a number of snapshot deltas, become abstract concepts in ACS, and drivers for primary storage can implement the operations for the abstract concepts.
My background is LINSTOR. LINSTOR leverages LVM's dm-thin to take snapshots; it uses thin-send-recv to generate snapshot deltas. You can find more about it here
But of course, many other primary storage solutions that are available for ACS probably also support continuous preservation of snapshot-deltas.
Ideally, incremental backups, or a base snapshot and a number of snapshot deltas, become abstract concepts in ACS, and drivers for primary storage can implement the operations for the abstract concepts.
My background is LINSTOR. LINSTOR leverages LVM's dm-thin to take snapshots; it uses thin-send-recv to generate snapshot deltas. You can find more about it here
But of course, many other primary storage solutions that are available for ACS probably also support continuous preservation of snapshot-deltas.
I'm not sure I got your point @Philipp-Reisner, but I'll not be touching any storage drivers in this implementation. This will be the native ACS implementation for NFS/Local/SharedMountPoint storages. We have plans to add incremental snapshots for Ceph, but then we will have to use Ceph's back-end for it. If others want to implement this feature for other drivers, they are free to do so.
Ideally, incremental backups, or a base snapshot and a number of snapshot deltas, become abstract concepts in ACS, and drivers for primary storage can implement the operations for the abstract concepts.
My background is LINSTOR. LINSTOR leverages LVM's dm-thin to take snapshots; it uses thin-send-recv to generate snapshot deltas. You can find more about it here
But of course, many other primary storage solutions that are available for ACS probably also support continuous preservation of snapshot-deltas.
I'm not sure I got your point @Philipp-Reisner, but I'll not be touching any storage drivers in this implementation. This will be the native ACS implementation for NFS/Local/SharedMountPoint storages. We have plans to add incremental snapshots for Ceph, but then we will have to use Ceph's back-end for it. If others want to implement this feature for other drivers, they are free to do so.
My understanding is, there should be some methods in the storage interfaces. Storage providers only need to implement the methods in their plugins, without touching the service layer.
I did not look into the design carefully. I think it is better to allow users to choose if the snapahot is full or incremental when they create snapshots. Of course you can add global settings to disable the options, which might be useful for cloud operators.
Hi @JoaoJandre ,this feature would be a huge benefit to us, and we would be very keen to adopt this for our day to day workflows. The plan seems comprehensive, and we're super excited to see this soon!
ISSUE TYPE
COMPONENT NAME
CLOUDSTACK VERSION
CONFIGURATION
OS / ENVIRONMENT
KVM, file storage (NFS, Shared mountpoint, local storage)
SUMMARY
This spec addresses a new feature to allow users to create differential volume snapshots/backups on KVM
1. Problem description
Currently, when taking a volume snapshot/backup with KVM as the hypervisor, ACS creates a temporary delta and makes it the VM's source file, with the original volume as a backing store. After that, the original volume is copied to another directory (with qemu-img convert), if the
snapshot.backup.to.secondary
configuration is set totrue
, the snapshot will be copied to the secondary storage, transforming it in a backup; the delta is then merged into the original volume. Using this approach, every volume snapshot is a full snapshot/backup. However, in many situations, always taking full snapshots of volumes is costly for both the storage network and storage systems. ACS already executes differential snapshots for XenServer volumes. Therefore, the goal of this proposal is to extend the current workflow for the integration with KVM, leveraging it to present a similar feature set as we have with XenServer.For the sake of clarity, in this document, we will use the following definitions of snapshots and backups:
2. Proposed changes
To address the described problems, we propose to extend the volume snapshot feature on KVM that was normalized by #5297, allowing users to create differential volume snapshots on KVM. To give operators fine control over which type of snapshot is being taken, we propose to add a new global configuration
kvm.incremental.snapshot
, which can be overridden on the zone and cluster configuration levels; this configuration will befalse
by default.Using XenServer as the hypervisor, the
snapshot.delta.max
configuration is used to determine the number of volume deltas that will be kept simultaneously in the primary storage. We propose to use the same configuration for the incremental snapshot feature on KVM, and use it to limit the size of the snapshot backing chain on the primary/secondary storage. We will also update the configuration description to specify that this configuration is only used with XenServer and KVM. The implications of thesnapshot.delta.max
configuration will be explained in the snapshot/backup creation section.Also, it's important to notice that, while the
snapshot.delta.max
configuration will define the maximum number of deltas for a backing chain on the primary/secondary storage; the maximum number of snapshots that will be available to the user is defined by the account's snapshot limit. The interactions between recurring snapshots, configurations and account limits section addresses the relationship between account limits and configurations.2.0.1. The DomainBackupBegin API
To allow incremental snapshots on KVM, we propose to use Libvirt's
domainBackupBegin
API. This API allows the creation of either full snapshots or incremental snapshots; it also allows the creation of checkpoints, which Libvirt uses to create incremental snapshots. A checkpoint represents a point in time after which blocks changed by the hypervisor are tracked. The checkpoints are Libvirt's abstraction of bitmaps, that is, a checkpoint always corresponds to a bitmap on the VM's volume.The
domainBackupBegin
API has two main parameters that interest us:backupXML
: this parameter contains details about the snapshots, including which snapshot mode to use, whether the snapshot is incremental from a previous checkpoint, which disks participate in the snapshot and the snapshot destination.checkpointXML
: when this parameter is informed, Libvirt creates a checkpoint atomically covering the same point in time as thebackup
.When using Libvirt's
domainBackupBegin
API, if thebackupXML
has the tag<incremental>
informing the name of a valid checkpoint, an incremental snapshot is created based on that checkpoint. Furthermore, the API requires that the volume is attached to a VM that is running or paused, as it uses the VM's process (QEMU process in the hypervisor operating system) to execute the volume snapshot.Libvirt's checkpoints are always linked to a VM, this means that if we undefine or migrate it, they will be lost. However, the bitmap on the volume does not depend on the VM; thus, if we save the checkpoint metadata, by using the
checkpointDumpXml
API, we can later use this XML to recreate the checkpoint on the VM after it is migrated or stopped/started on ACS[^stop-start-vm]. Therefore, even if the VM is migrated or recreated, we can continue to take incremental snapshots.More information on the
domainBackupBegin
API can be found in the official documentation; also, more information on Libvirt's checkpoints can also be found in the official documentation.2.0.2. Limitations
This feature will use the Libvirt
domainBackupBegin
API that was introduced in version 7.2.0, and extended to allow incremental snapshot in version 7.6.0; furthermore, the incremental snapshot API needs qemu 6.1. Thus, this feature will only be available in environments with Libvirt 7.6.0+ and qemu 6.1+. If thekvm.incremental.snapshot
configuration istrue
, but the hosts do not have the necessary Libvirt and qemu versions, an error will be raised when creating a snapshot.As the snapshots do not contain the bitmaps that were used to create them, after reverting a volume using a snapshot, the volume will have no bitmaps; thus, we will need to start a new snapshot chain.
Furthermore, this feature will only be available when using file-based storage, such as shared mount point (iSCSI and FC), NFS and local storage. Other storage types for KVM, such as CLVM and RBD, need different approaches to allow incremental backups; therefore, those will not be contemplated in the proposed spec.
2.1. Snapshot/Backup creation
The current snapshot creation process is summarized in the diagram below. Its main flaw is that we always copy the snapshot to a directory on the primary storage, even if we will later copy it to the secondary storage, doubling the strain on the storage systems.
<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1706624986/specs/cloudstack/kvm-incremental-snapshots/Old_snapshot_creation_s1o0gr.png" alt="create-snapshot-old" style="width: 100%; height: auto;">
DomainSnapshotCreateXML
API;The proposed incremental snapshot creation workflow is summarized in the following diagram. We propose to optimize the current workflow, as well as add a new one that allows the creation of incremental snapshots.
<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1708696515/specs/cloudstack/kvm-incremental-snapshots/k7yfghgwuualip0unhzu.png" alt="create-snapshot" style="width: 100%; height: auto;">
If
kvm.incremental.snapshot
is false, we keep the old API usage and main logic, but we propose to copy the snapshot directly to the final destination, instead of always copying to the primary storage first.If
kvm.incremental.snapshot
is true and the host does not have the minimum Libvirt and qemu versions, an exception will be thrown.If
kvm.incremental.snapshot
is true and the host has the minimum Libvirt and qemu versions, the following workflow will be executed:domainBackupBegin
API. Also, to use the previous checkpoints, we must recreate them on the dummy VM. After the snapshot process is done we will destroy the dummy VM.domainBackupBegin
API and not referencing any old checkpoints; else, we call the same API, but referencing the last checkpoint. Either way, we make the API create the snapshot directly on the correct storage based on thesnapshot.backup.to.secondary
configuration;snapshot.backup.to.secondary
configuration, if it isfalse
, the primary storage path will always be used, as the snapshots will be on the primary storage, if the configuration istrue
, the same logic applies, using the secondary storage path.snapshot.backup.to.secondary
configuration.The process of editing the checkpoint dump and then redefining it on the VM is needed because, even though these metadata are not important to the backup, Libvirt will validate it if informed when recreating the checkpoints on other VMs. Also, we must manually edit the checkpoint parent because Libvirt always assumes that a new checkpoint is a child of the latest one, even if the checkpoints are not connected.
During the snapshot process, the volume will be in the
Snapshotting
state, while the volume is in this state, no other operations can be done with it, such as volume attach/detach. Also, if the volume is attached to a VM, the snapshot job is queued alongside the other VM jobs, therefore, we do not have to worry about the VM being stopped/started during the volume snapshot, as each job is processed sequentially for each given VM.In runtime, if the
kvm.incremental.snapshot
configuration is changed fromfalse
totrue
, when taking a snapshot of a volume, a new snapshot chain will begin, that is, the next snapshot will be a full snapshot and the later ones will be incremental. If the configuration is changed fromtrue
tofalse
, the current snapshot chains of the volumes will not be continued, and the future snapshots will be full snapshots.As different clusters might have different Libvirt versions, the
kvm.incremental.snapshot
configuration can be overwritten on the cluster level. If thekvm.incremental.snapshot
configuration is true for a cluster that does not have the needed Libvirt version, an error will be raised informing that the configuration should be set to false on this cluster, as it does not support this feature. In any case, the snapshot reversion will be the same for any cluster, as it will still use the same APIs that are used today.We propose to save the checkpoint as a file in the primary/secondary storage instead of directly on the database because the
domainBackupBegin
API needs a file as input, if we kept the checkpoint XML in the database we would need to create a temporary file anyway. Furthermore, as the checkpoints are only useful if their corresponding snapshot exists; if we lose the storage where the snapshot is, the checkpoint becomes useless, so keeping them together in the primary/secondary storage seems to be the best approach in the current scenario.To persist the checkpoint location in the database, a new column will be added to the
snapshot_store_ref
table:kvm_checkpoint_path
. This column will have thevarchar(255)
type (same as theinstall_path
column) and will benull
by default. When an incremental snapshot is taken in KVM, its corresponding checkpoint path will be saved in this column. Also, another column will be added to the same table:end_of_chain
; this column will have theint(1) unsigned
type and will be 1 by default. This column will be used mainly when a process causes a chain to be severed and the next snapshot must be a full one, such as when restoring a snapshot.When working with incremental snapshots, the first snapshot in the snapshot chain will always be a full snapshot: this is needed as we must have something to start "incrementing" from. The maximum size of the snapshot chain on the primary/secondary storage will be limited by the
snapshot.delta.max
configuration; after this limit is reached, a new snapshot chain will be started, that is, the next snapshot will be a full snapshot. Also, to avoid having too many checkpoints on a VM, we will delete the old checkpoints when creating a new snapshot chain.The reason why we decided to propose a limited snapshot chain instead of creating an unlimited snapshot chain is that, while taking full snapshots is costly, the risk of eventually losing the base snapshot, therefore losing all snapshots, increases with the size of the chain. This approach has been tested and validated by the industry; it is used in XenServer and VMware with Veeam as the Backup provider, for example. Nonetheless, taking a full snapshot from time to time is still far cheaper than always taking full snapshots.
Let us take the following example: The user has a volume with 500GB allocated, that grows 20GB per day; they have a recurring snapshot policy set to create snapshots every day and keep the last 7 days stored; using full snapshots, by the end of the week, they'll be using 3.92 TB of storage; using incremental snapshots with
snapshot.delta.max
greater than 6, only 620 GB of storage would be used.2.1.1. Interactions between recurring snapshots/backups, configurations and account limits
This section will give some examples to illustrate the interactions between the user's snapshot/backup limit, the
snapshot.delta.max
configuration and themaxSnaps
parameter of thecreateSnapshotPolicy
API. The examples are described in the table below. When "removing" incremental backups, they might stay for a while in the primary/secondary storage, more about incremental backup deletion can be seen in the snapshot deletion section.snapshot.delta.max
maxSnaps
maxSnaps
limit was reached, the 1st one will be logically removed; assnapshot.delta.max
was reached, the new backup will be the start of a new backing chainmaxSnaps
limit was reached, the 1st one will be logically removed; however, the backup will still be an incremental backup, a new chain will only be started on the 8th backup, when thesnapshot.delta.max
is reachedsnapshot.delta.max
was reached, the new backup will be the start of a new backing chain; however, the 1st backup will only be removed aftermaxSnaps
is reached2.2. Snapshot/Backup Reversion
The proposed new snapshot restore process is summarized in the diagram below.
<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1704718061/specs/cloudstack/kvm-incremental-snapshots/Backup_reversion1_3_nwwuwt.png" alt="create-snapshot" style="width: 100%; height: auto;">
There are two possibilities when restoring snapshots:
qemu-img convert
command to create a consolidated copy of the snapshot, that is, the command will consolidate all the snapshot backing files and copy the result to where we need it.After restoring a snapshot, we cannot continue an incremental snapshot chain, therefore, the
end_of_chain
has to be marked as true on the latest snapshot created, this way we will know when creating the next snapshot that it will start a new chain.2.3. Template/Volume creation from snapshot/backup
The current process of creating a template/volume from a snapshot can be adapted to allow creating from an incremental snapshot. The only difference is that we need to use
qemu-img convert
on the snapshot before sending the command to the SSVM, similar to what is currently done when the snapshot is not on the secondary storage. After the template/volume is created, we can remove the converted image from the secondary storage. The diagram below (taken from #5124) summarizes the current template/volume creation from snapshot/backup process.<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1702059856/specs/cloudstack/kvm-incremental-snapshots/122423092-0b685c00-cf64-11eb-8a73-f86032a09412_f9xkkd.jpg" alt="create-snapshot" style="width: 100%; height: auto;">
2.4. Snapshot/Backup deletion
The diagram below summarizes the snapshot deletion process:
<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1703772199/specs/cloudstack/kvm-incremental-snapshots/New_Incremental_Backup_Deletion_1_o3f57h.png" alt="snapshot-deletion" style="width: 100%; height: auto;">
When deleting incremental snapshots, we have to check for two things: the snapshot has any ancestors; the snapshot has any descendants.
After marking the snapshot as removed on the database, if it has any active descendants, we will keep it in the primary/secondary storage until those descendants are removed, only then we will delete the snapshot from the storage; if the snapshot does not have any descendants, we delete it immediately. We do this to preserve the descendants' ability to be restored, otherwise, the backing chain would be broken and all the descendants would be useless.
After checking for descendants, we check if the snapshot has any ancestors, if it does, we delete any ancestors that were removed in the database but were kept in storage.
The checkpoint deletion is directly linked to its corresponding incremental snapshot, we must keep the checkpoint until the snapshot is deleted, otherwise, we will not be able to continue taking incremental snapshots after VM migration or volume detach, for example.
2.5. Volume migration
2.5.1 Live volume migration
When a volume is created via linked-clone, it has a source and a backing file; currently, when live migrating from NFS to NFS, the same structure will be maintained on the new storage; otherwise, ACS consolidates the source and backing files while migrating. Also, before migrating the volume, ACS will check if the template is on the destination storage, if it is not, it will be copied there, even though for consolidated volumes it is unnecessary. Furthermore, current live volume migration always migrates the VM's root volume, even if the user only requested a data disk migration, putting unnecessary strain on storage and network systems.
Moreover, when live migrating a volume of a VM that has any volume on NFS storage, if the volumes on NFS are not migrated, the migration will fail. This happens as the current migration command used does not inform the
VIR_MIGRATE_PARAM_MIGRATE_DISKS
parameter, which specifies which volumes should be migrated alongside the VM migration; without this parameter, Libvirt will assume that all the volumes should be migrated, raising an error when it tries to overwrite the NFS volumes over themselves.Furthermore, when migrating to an NFS storage, ACS will validate if the destination host has access to the source storage. This causes an issue when migrating from local storage to NFS storage, as the destination host will never have direct access to the source host's local storage. As the current volume migration process has several inconsistencies, it will be normalized alongside this feature. The current volume migration workflow is summarized in the diagram below.
<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1703772418/specs/cloudstack/kvm-incremental-snapshots/Old_Volume_Migration_4_i6fzz1.png" alt="migrate-volume-old" style="width: 100%; height: auto;">
We propose to normalize the migration behavior when migrating from file-based storage to file-based storage. We will always consolidate the volume with its backing file; thus, copying the template to the new storage will be unnecessary. This way the live volume migration will always have the same behavior. Furthermore, we will only migrate the VM's root volume when the user asks for it. Also, we will remove the special case of checking if the destination host has access to the source storage when the destination storage is NFS.
Moreover, we will change the migration API used from
virDomainMigrate2
tovirDomainMigrate3
, this API allows us to inform theVIR_MIGRATE_PARAM_MIGRATE_DISKS
parameter to tell Libvirt to only migrate the volumes we want; therefore avoiding the aforementioned error with volumes on NFS storage.As ACS's live volume migration also needs a VM migration on KVM, and Libvirt's migrate command does not guarantee that the volume bitmaps will be copied, after live migrating volume, we will have to start a new snapshot chain. The new migration workflow is summarized in the diagram below.
<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1710334355/specs/cloudstack/kvm-incremental-snapshots/New_Live_Volume_Migration_plpjks.png" alt="migrate-volume-new" style="width: 100%; height: auto;">
We will not allow volume migration when the volume has snapshots on the primary storage, as there are a few cases where this could bring inconsistencies. For example, if we live-migrate the VM and migrate the volume from local storage to a zone/cluster scope storage, the VM's destination host will not have access to the old snapshots, making them useless. This limitation is already present on the current implementation, where all the snapshots of a volume that is being migrated are listed from the database and if any of them are not located on the secondary storage, an exception will be raised.
2.5.2 Cold volume migration
When performing cold migration on a volume using KVM as a hypervisor, ACS will first use
qemu-img convert
to copy the volume to secondary storage. Then, the volume will be copied to the destination primary storage. The diagram below summarizes the cold migration workflow.<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1703779206/specs/cloudstack/kvm-incremental-snapshots/cold_volume_migration_zzhm7b.png" alt="cold-volume-migration-old" style="width: 100%; height: auto;">
The only change that we need is to add the
--bitmaps
parameter to theqemu-img convert
command used, so that the volume keeps its existing bitmaps, otherwise, we would need to create a new backup chain for the next backup.2.6. Checkpoint management
There are a few processes that will be tweaked to make checkpoints consistent:
[^absolute-path]: We can do this as NFS storage is always mounted using the same path across all hosts. The path is always
/mnt/<uuid>
where<uuid>
is derived from the NFS host and path. Also, for SharedMountPoint storages, the path must be the same as well. [^stop-start-vm]: When a VM is stopped on ACS with KVM as the hypervisor, the VM actually gets undefined on Libvirt, later, when the VM is started, it gets recreated.