apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.1k stars 1.11k forks source link

KVM Disk-only VM Snapshots #9524

Open JoaoJandre opened 3 months ago

JoaoJandre commented 3 months ago
ISSUE TYPE
COMPONENT NAME
VM Snapshot
CLOUDSTACK VERSION
4.20/main
CONFIGURATION
OS / ENVIRONMENT

KVM, file storage (NFS, Shared mountpoint, local storage)

SUMMARY

This spec addresses an update to the disk-only VM snapshot feature on the KVM

1. Problem Description

Currently, using KVM as the hypervisor, CloudStack does not support disk-only snapshots of VMs with volumes in NFS or local storage, CloudStack also does not support VM snapshots for stopped VMs; this means that if the user needs some sort of snapshot of their volumes, they must use the volume snapshot/backup feature. Furthermore, the current implementation relies on the same workflows as volume snapshots/backups:

  1. The VM will be frozen (ignoring the quiesce parameter);
  2. Each volume will be processed individually using the volume snapshot workflow;
  3. Once all the snapshots are done, the VM will be resumed.

However, this approach is flawed: as we not only create the snapshots, but also copy all of them to another directory, there will be a lot of downtime, as the VM is frozen during this whole process. This downtime might be extremely long if the volumes are big.

Moreover, as the snapshots will be copied to another directory in the primary storage, the revert takes some time as we need to copy the snapshot back.

1.1 Basic Definitions

Here are some basic definitions that will be used throughout this spec:

2. Proposed Changes

To address the described problems, we propose to extend the VM snapshot feature on KVM to allow disk-only VM snapshots for NFS and local storage; other types of storage, such as shared-mount-point, already support disk-only VM snapshot. Furthermore, we intend to change the disk-only VM snapshot process for all other file-based storages (local, NFS and shared-mount-point):

  1. We will take all the snapshots at the same time, instead of one at a time.
  2. Unlike volume snapshots, the disk-only VM snapshots will not be copied to another directory, they will stay as is after taken and be part of the volumes' backing chains. This makes reverting a snapshot much faster as we only have to change the paths that will be pointed to in the VM's DOM.
  3. The VM will only be frozen if the quiesceVM parameter is true.

2.0.2. Limitations

2.1. Disk-only VM Snapshot Creation

The proposed disk-only VM snapshot creation workflow is summarized in the following diagram.

<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1715878023/specs/cloudstack/disk-only-vm-snapshot/vm_snapshot_creation_2_tftl1d.png" alt="create-snapshot" style="width: 100%; height: auto;">

Unlike the volume snapshots, the disk-only VM snapshots are not designed to be backups; thus, we will not copy the disk-only VM snapshots to another directory or storage. We want the disk-only snapshots to be fast to revert whenever needed, and keeping them in the volumes backing-chain is the best way to achieve this.

Currently, the VM is always frozen and resumed during the snapshot process, regardless of what is informed in the quiesceVM parameter. This process will be changed, the VM will only be frozen if the quiesceVM is informed. Furthermore, the downtime of the proposed process will be orders of magnitude smaller then the current implementation, as there will not be any copy while the VM is frozen.

During the VM snapshot process, the snapshot job is queued alongside the other VM jobs; therefore, we do not have to worry about the VM being stopped/started during the snapshot, as each job is processed sequentially for each given VM. Furthermore, after creating the VM snapshot, ACS already forbids detaching volumes from the VM, so we do not need to worry about this case as well.

2.2. VM Snapshot Reversion

The proposed disk-only VM snapshot restore process is summarized in the diagram below. The process will be repeated for all the VM's volumes.

<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1715707430/specs/cloudstack/disk-only-vm-snapshot/vm_snapshot_reversion_1_iw5dej.png" alt="revert-snapshot" style="width: 100%; height: auto;">

  1. If the VM is running, we throw an exception, otherwise, we continue;
  2. If the current delta's parent is dead, we:
    1. Merge our sibling and our parent;
    2. Rebase our sibling's children (if any) to point to the merged snapshot;
    3. Remove our sibling's old file, as it is now empty;
    4. Change our sibling's path in the DB so that it points to the merged file;
  3. Delete the current delta that is being written to. It only contains changes on the disk that will be reverted as part of the process;
  4. Create a new delta on top of the snapshot that is being reverted to, so that we do not write directly into it and are able to return to it again later;
  5. Update the volume path to point to the newly created delta.

The proposed process will allow us to go back and forth on snapshots if need be. Furthermore, this process will be much faster than reverting a volume snapshot, as the bottleneck here is deleting the top delta that will not be used anymore; which should be much faster than copying a volume snapshot from another storage and replacing the old volume.

The process done in step 2 was added to cover an edge case where dead snapshots would be left in the storage until the VM was expunged. Here's a simple example of why it's needed:

<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1715784150/specs/cloudstack/disk-only-vm-snapshot/Drawing_1_gsu9vr.png" style="width: 10%; display: block; margin-left: auto; margin-right: auto; height: auto;">

<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1715784167/specs/cloudstack/disk-only-vm-snapshot/Drawing_2_vmmjhl.png" style="width: 5%; display: block; margin-left: auto; margin-right: auto; height: auto;">

<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1715784171/specs/cloudstack/disk-only-vm-snapshot/Drawing_3_jlf7xx.png" alt="revert-snapshot-ex3" style="width: 5%; display: block; margin-left: auto; margin-right: auto; height: auto;">

2.3. VM Snapshot Deletion

In order to keep the snapshot tree consistent and with the least amount of dead nodes, the snapshot deletion process will always try to manipulate the snapshot tree to remove any unneeded nodes while keeping the ones that are still needed; even if they were removed by the user, in these cases, they'll be marked as deleted on the DB, but will remain on the storage primary until they can be merged with another snapshot. The diagram below summarizes the snapshot deletion process, this process will be repeated for all the VM's volumes:

<img src="https://res.cloudinary.com/sc-clouds/image/upload/v1715869488/specs/cloudstack/disk-only-vm-snapshot/vm_snapshot_deletion_6_iobpxf.png" alt="snapshot-deletion" style="width: 100%; height: auto;">

As this diagram has several branches, each branch will be explained separately:

The proposed deletion process leaves room for one edge case, which can lead to a dead node that would only be removed when the volume was deleted: If we revert to a snapshot that has one other child and then delete it, using the above algorithm, the deleted snapshot will end up only marked as removed on the DB. If we revert to another snapshot, this will leave a dead node on the tree that would not be removed (the snapshot that was previously deleted). To solve this edge case, when this specific situation happens, we will do as explained in the snapshot reversion section and merge the dead node with its child.

2.4. Template Creation from Volume

The current process of creating a template from a volume does not need to be changed. We already convert the volume when creating a template, so the volume's backing chain will be merged when creating a template.

slavkap commented 2 months ago

Hi @JoaoJandre, there is a similar functionality for VM snapshots without memory Introduced in this PR and this PR allows it for NFS/Local storage It doesn't support VM snapshots for stopped VMs but I think it will be a small change What I got from libvirt docs and a few forums is that using the flag VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE is discouraged.

If flags includes VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE, then the libvirt will attempt to use guest agent to freeze and thaw all file systems in use within domain OS. However, if the guest agent is not present, an error is thrown. Moreover, this flag requires VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY to be passed as well. For better control and error recovery users should invoke virDomainFSFreeze manually before taking the snapshot and then virDomainFSThaw to restore the VM rather than using VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE.

Probably you could leave the usage of virDomainFSFreeze/virDomainFSThaw to be executed by the value of the quiesceVm parameter and by the state of the VM (running/stopped).

JoaoJandre commented 2 months ago

Hi @JoaoJandre, there is a similar functionality for VM snapshots without memory Introduced in this PR and this PR allows it for NFS/Local storage It doesn't support VM snapshots for stopped VMs but I think it will be a small change What I got from libvirt docs and a few forums is that using the flag VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE is discouraged.

If flags includes VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE, then the libvirt will attempt to use guest agent to freeze and thaw all file systems in use within domain OS. However, if the guest agent is not present, an error is thrown. Moreover, this flag requires VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY to be passed as well. For better control and error recovery users should invoke virDomainFSFreeze manually before taking the snapshot and then virDomainFSThaw to restore the VM rather than using VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE.

Probably you could leave the usage of virDomainFSFreeze/virDomainFSThaw to be executed by the value of the quiesceVm parameter and by the state of the VM (running/stopped).

Hello, @slavkap

I'm aware of the current functionality, but I was not aware that it was made to support NFS/Local storage. Regardless, I have listed (in the spec) a few other issues with it:

  1. It does not support VM snapshots for stopped VMs;
  2. The process is based on the volume snapshot, which is much slower as we take one snapshot at a time, instead of using one command for all the volumes;
  3. The VM is frozen (regardless of the user's orders) during the whole snapshot process, including the copy of the snapshots, which is a huge waste of time for the VM (this is made worse by the point above);
  4. The proposed implementation will not copy the snapshots, making the snapshot creation/reversion process much faster;

In any case, this feature will only be used for NFS/SMP/Local storage, for the other types of storage (such as RBD or iSCSi), the implementation introduced in #3724 will still be used.

Regarding the domain freeze/thaw, the quote you posted says "For better control and error recovery users should invoke virDomainFSFreeze manually before taking the snapshot and then virDomainFSThaw to restore the VM rather than using VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE.", I saw the implementation of the freeze/thaw and there doesn't seem to be any error recovery attempt, so using it instead of the quiesce parameter does not seem any better. I'm not sure what type of error recovery we could do to be fair; so again, I don't see a point in using the freeze/thaw.