apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.05k stars 1.1k forks source link

KVM disk-only based snapshot of volumes instead of taking VM's full snapshot and extracting disks #5124

Closed GutoVeronezi closed 2 years ago

GutoVeronezi commented 3 years ago
ISSUE TYPE
COMPONENT NAME
Volume Snapshot
CLOUDSTACK VERSION

4.16/main

CONFIGURATION
OS / ENVIRONMENT

KVM, file storage (NFS/Shared mountpoint)

SUMMARY

This spec changes the volume snapshot feature in KVM to a disk-only snapshot instead of taking a full snapshot and extracting disk, consequently changing the revert snapshot workflow.

Problem Description

To avoid repeating the same sentence several times, consider:


Create snapshot

Currently, ACS has a feature to take snapshots from volumes. For KVM, when snapshot backup is enabled, it takes a full snapshot from VM, extracts the disk, and then deletes the snapshot; When snapshot backup is disabled, it takes a full snapshot and keeps it. The full snapshot is a longstanding process, which affects VM's memory and disks; While this process is running, the VM gets frozen due to memory snapshot; i.e users are unable to manipulate their VM for a long time (if the amount of memory is huge).

Current_Create_Snapshot_Workflow

Revert snapshot

When snapshot backup is disabled, the Revert to Snapshot process does not work. The script managesnapshot.sh has a bug that when the function revert_snapshot fails, it keeps returning 0, assumimg that the command was successful.

Current_Revert_Snapshot_Workflow

Create template/volume from snapshot

When snapshot backup is disabled, both process of creating template and creating volume from snapshot fails due to inconsistent path.

current-create-template-from-snapshot-workflow

current-create-volume-from-snapshot-workflow

Delete snapshot

When deleting a snapshot, validations are made only to snapshots that were backed up. Snapshots on primary storage just get deleted.

current-delete-snapshot-workflow

Migrate volume (migrateVolume/migrateVirtualMachineWithVolumes)

When migrating a snapshot (migrateVolume/migrateVirtualMachineWithVolumes) and snapshot backup is disabled, snapshots metadata are held by qemu. The workflow to migrate a volume consist in copying the old volume to a new one, and then deleting the old volume. By deleting old volume, the snapshots metadata held by qemu are erased too.

current-migrate-volume-workflow

Proposed Change

This proposal affects a serie of workflows due to the new create snapshot workflow:

Create snapshot

This proposal changes the process of volume snapshot in KVM to take a disk-only snapshot by adding the flag --disk-only and the folder of the external snapshot to the command. To keep the current behavior of backup to secondary storage and to keep everything in a single file, it will take a disk-only snapshot, copy the snapshot to a folder in the primary storage, and merge the snapshot and the branch file created during the process; if snapshot backup is enabled, backups snapshot to the secondary storage. This will avoid creating and handling chains of snapshots.

Proposed Create Snapshot Workflow

Obs: The current implementation is not handling diff snapshots for KVM. Therefore, this proposal and implementation will not address this issue.

Revert snapshot

This proposal intends to replace the VM base file with the snapshot from the primary storage if it exists, else with the snapshot from the secondary storage.

Proposed_Revert_Snapshot_Workflow

Create template/volume from snapshot

This proposal intends to, if it doesn't exists, copy the snapshot from primary storage to secondary storage, send command to SSVM create the resource and, if snapshot backup is disabled, delete the snapshot from secondary storage.

proposed-create-template-from-snapshot-workflow

proposed-create-volume-from-snapshot-workflow

Delete snapshot

This proposal intends to validate snapshot both in primary and secondary storage and unify the process of excluding, as what changes is the type of the entry in the database and the path to exclude.

proposed-delete-snapshot-workflow-2021-07-22

Migrate volume (migrateVolume/migrateVirtualMachineWithVolumes)

This proposal intends to validate if there is any snapshot for volume before migrating it. If exists, it will throw an exception warning the operator that there snapshots on primary storage.

proposed-migrate-volume-workflow

Obs: As the currently workflows don't work correctly with snapshot backup disabled, this proposal and implementation will not address migrate snapshots with the volume. Also, we do not have any use case to spend efforts on it.

Work Items

Future works

The current snapshot process doesn't support incremental snapshots as it backups all the data into a single file and has not a consistent chain of backing files. As future work, we could implement consistent chain of backing files and backing up only the difference catch by the snapshot.

The current volume migration process doesn't support migrate the snapshot with it, when the snapshot backup is disabled. As future work, when the snapshot backup is disabled, we could implement the migration of the snapshots along with the volumes.

We will re-use the create volume/template from snapshots workflows that already exist to implement the execution flow for the cases where snapshots are kept in primary storage only. Therefore, we adapted the current workflow as follows, we copy the snapshot to the secondary storage, send the command to SSVM, SSVM copies the snapshot as volume/template, and then, it removes the snapshot from the secondary storage. Even though it is not the optimal method to implement such processes, we do it to avoid large code changes; however, as future work, we will implement a straight copy (from primary storage to volume/template).

weizhouapache commented 3 years ago

@GutoVeronezi good design !

GabrielBrascher commented 3 years ago

@GutoVeronezi thanks for sharing this design; looks good. This execution flow should be improved indeed. I can't wait to see the implementation.

DaanHoogland commented 3 years ago

@GutoVeronezi I didn't study your design yet, but it looks nice. Tradition dictates that we do designs like this on the wiki and I (and several others) can give you access. (not saying that we shouldn't change our ways, just saying what our ways are. There is a page per planned release there e.g. 4.16 that has subpages per enhancement.

The decision to add a design page is kind of arbitrary but since you give us this elaborate design already, i thought I'd point you to it.

GutoVeronezi commented 3 years ago

@DaanHoogland Thanks for the tip.

I can add the design to the wiki if it makes sense, but I would rather keep things in github, as we already manage PRs and issues here. Also, I think that interactions through github are easier and reliable.

DaanHoogland commented 3 years ago

no problem @GutoVeronezi

nvazquez commented 3 years ago

Hi @GutoVeronezi have you started any work/PR on this enhacement?

weizhouapache commented 3 years ago

@GutoVeronezi is this similas as Storage-based Snapshots for KVM VMs #3724 ?

GutoVeronezi commented 3 years ago

Hi @nvazquez, yes, I already started it. Actually I'm on the halfway of the MVP, as soon as it is ready I will open a PR.

GutoVeronezi commented 3 years ago

@weizhouapache I believe the final goal may be similar, but the approaches are different. #3724 works with freezing and thaw the VM, while this proposal will work with external disk-only snapshots (consequently changing all the snapshot workflow).

Maybe in a week or two from now we will be able to discuss with some code 🙂.

andrijapanicsb commented 12 months ago

@GutoVeronezi I'm troubleshooting an env, where the VM snapshots with --disk-only step was executed, but I can't find HOW the image is copied over to the Primary Storage (I can see the ACS agent is writing down constantly, but I can not see any qemu-imfg or "cp" or other command which copies to file to Primary Storage - please refer to the below screenshot:

image

My questions: who does / how is it done the step of "copy a VM base file to a folder on primary storage" and the next one - I need to be able to trace exact steps for the troubleshooting purposes (previously I could just grep for "qemu-img convert" and know what is going on - but now I need help due to changes, please

Greatly appreciated

GutoVeronezi commented 12 months ago

@andrijapanicsb, it was being done via Files.Copy; the message Snapshot [%s] took [%s] seconds to finish. was being printed before the copy process and the message Copied %s snapshot from [%s] to [%s]. was being printed after the copy process; therefore, you can track the process with those messages.

With PR #8041, the process was changed to use qemu-img convert and the logs were also improved; therefore, you will be able to track it better. Please, refer to PR #8041.

andrijapanicsb commented 12 months ago

Thx @GutoVeronezi 👍