Checkpoint file is not always cleaned up on VM Action

Description When a virtual machine is Suspended or Stopped then later resumed, or after some Migrations, the checkpoint file may not be cleaned up properly leading to excess disk usage of the system datastore until the VM is terminated which cleans up any extra checkpoint files.

To Reproduce Suspend a VM, then resume it, observe /var/lib/one/datastores/SYSTEM_DS/VM_ID/checkpoint*, suspend and resume the VM again to create another checkpoint file The issue may also happen after migrations between hypervisors which use the checkpoint

Expected behavior The checkpoint file should be cleaned up properly after it is no longer required

Details

Affected Component: VMM, Datastore
Hypervisor: KVM
Version: 6.10

Progress Status

[ ] Code committed
[ ] Testing - QA
[ ] Documentation (Release notes - resolved issues, compatibility, known issues)

Hello.

I added a lot of logs onto a support ticket which I believed is what created this issue (they linked me here). I did a little bit more investigating in my environment and found something interesting that I thought might be of some value to you.

I have plenty of space in my system datastore, /var/lib/one/datastores/0 so it seems like the No space left on device errors i'm getting are referring to the actual checkpoint itself which I thought might be a good clue... I had expected the checkpoint files to be able to grow until my mountpoint ran out of free space but it seems limited somehow.

Mon Sep 9 12:40:05 2024 [Z0][VMM][I]: Successfully execute network driver operation: post.
Mon Sep 9 12:40:05 2024 [Z0][VM][I]: New LCM state is RUNNING
Wed Sep 18 12:12:29 2024 [Z0][VM][I]: New LCM state is SAVE_MIGRATE
Wed Sep 18 12:17:52 2024 [Z0][VMM][I]: Command execution fail (exit code: 1): cat << 'EOT' | /var/tmp/one/vmm/kvm/save '725e613d-5db8-42ff-b5f7-4b9f69c2601a' '/var/lib/one//datastores/0/222/checkpoint' 'opennebulahost01.domain.tld' 222 opennebulahost01.domain.tld
Wed Sep 18 12:17:52 2024 [Z0][VMM][E]: save: Command "virsh --connect qemu:///system save 725e613d-5db8-42ff-b5f7-4b9f69c2601a /var/lib/one//datastores/0/222/checkpoint" failed: error: Failed to save domain '725e613d-5db8-42ff-b5f7-4b9f69c2601a' to /var/lib/one//datastores/0/222/checkpoint error: operation failed: /usr/libexec/libvirt_iohelper: failure with /var/lib/one/datastores/0/222/checkpoint: unable to fsync /var/lib/one/datastores/0/222/checkpoint: No space left on device Could not save 725e613d-5db8-42ff-b5f7-4b9f69c2601a to /var/lib/one//datastores/0/222/checkpoint
Wed Sep 18 12:17:52 2024 [Z0][VMM][I]: ExitCode: 1
Wed Sep 18 12:17:52 2024 [Z0][VMM][I]: Failed to execute virtualization driver operation: save.
Wed Sep 18 12:17:52 2024 [Z0][VMM][E]: SAVE: ERROR: save: Command "virsh --connect qemu:///system save 725e613d-5db8-42ff-b5f7-4b9f69c2601a /var/lib/one//datastores/0/222/checkpoint" failed: error: Failed to save domain '725e613d-5db8-42ff-b5f7-4b9f69c2601a' to /var/lib/one//datastores/0/222/checkpoint error: operation failed: /usr/libexec/libvirt_iohelper: failure with /var/lib/one/datastores/0/222/checkpoint: unable to fsync /var/lib/one/datastores/0/222/checkpoint: No space left on device Could not save 725e613d-5db8-42ff-b5f7-4b9f69c2601a to /var/lib/one//datastores/0/222/checkpoint ExitCode: 1
Wed Sep 18 12:17:52 2024 [Z0][VM][I]: New LCM state is RUNNING
Wed Sep 18 12:17:52 2024 [Z0][LCM][I]: Fail to save VM state while migrating. Assuming that the VM is still RUNNING.

I originally theorized that the checkpoint would grow as large as the VM has memory assigned to it but it seems that this checkpoint file stopped ~3GB shy of how much RAM is on the VM so my theory seems incorrect.

For the time being, if our system datastore runs out of space due to checkpoint files being retained we simply terminate and re-instantiate the VM which removes the checkpoints.

OpenNebula / one

Checkpoint file is not always cleaned up on VM Action #6729

Progress Status