Corsinvest / cv4pve-autosnap

Automatic snapshot tool for Proxmox VE
https://www.corsinvest.it/cv4pve
GNU General Public License v3.0
413 stars 51 forks source link

when zfs dataset is busy the code leaves a vm or ct in lock-delete mode. #27

Closed andrewbauman closed 4 years ago

andrewbauman commented 5 years ago

issue with version 1.2.0 So when a zfs sync is taking longer than the snapshot interval then zfs blocks deletion of snapshots. cv4pve autosnapshot then leaves the all containers in locked mode preventing taking new snapshots.

We want to keep taking new snapshots so the code needs to detect fail due to dataset busy and unlock the vm after posting delete fail task.

andrewbauman commented 5 years ago

Perhaps this is best addressed by a hook script?

andrewbauman commented 5 years ago

I upgraded to Version 1.5.0 Confirmed that new code now check if vm lock before running commands and posts single failed task in log. Confirmed that we still do leave vm in snapshot-delete state lock and cannot automatically continue making snapshots. Would it be wise to test for zfs dataset busy and run a snap only? I realize if delete stops working then we may fill storage space fast.

franklupo commented 5 years ago

Why not use hook script?

GusevVictor commented 5 years ago

Why not use hook script?

Please post an example of this if you can. Thank.

franklupo commented 5 years ago

The script hook script-hook.bat or script-hook.sh pass environment variables:

CV4PVE_AUTOSNAP_PHASE contains: snap-job-start, snap-job-end, snap-create-pre, snap-create-post, snap-create-abort. snap-remove-pre, snap-remove-post, snap-remove-abort

CV4PVE_AUTOSNAP_VMID CV4PVE_AUTOSNAP_VMTECHNOLOGY CV4PVE_AUTOSNAP_LABEL CV4PVE_AUTOSNAP_KEEP CV4PVE_AUTOSNAP_VMSTATE CV4PVE_AUTOSNAP_SNAP_NAME

You can intercept phase and execute you command.

Best regards

andrewbauman commented 5 years ago

Thanks for the reply, I do have a working knowledge of userland zfs commands and do some rudimentary hacking of bash scripts. However I am not familiar with C# so my question is if I run a bash command in hook script [ which to my understanding gets run by cv4pve after setting variables ] will cve4pve then stop running and return error if the script does not return 0 for success?

franklupo commented 5 years ago

At this time the result of the script is not evaluated.

alekseyp commented 5 years ago

I can confirm the same behavior/problem, on v1.2-1.5. Right now I have a separate script that cleans those up after a day or so, just to be safe not to unlock vm earlier in case the process is still running.

franklupo commented 5 years ago

Hi, currently timeout is 30 seconds. Do you want the ability to change this time?

best regards

andrewbauman commented 5 years ago

Hello all, So I have done more thinking of this issue. the problem is that the vm or ct is left in "locked - snapshot delete" when the delete snapshot command fails because zfs dataset is busy. Now bad things happen. we no longer take snapshots because proxmox api says vm is locked. ZFS would be fine taking new snapshots just not deleting because of zfs send needing the snapshots. I have setup cv4snapshot in a container so I would need to use ssh to run commands on root, this makes some difficulty. I think Ideally the cv4 code should trap this particular error and run unlock so that the next snapshot can run. this way would let us see errors but not stop new snapshots. Possibly an adjustable timeout would be sufficient.

franklupo commented 5 years ago

Hi, we have several installations with ZFS but we have never encountered this problem. You could better detail what happens.

Best regards

cheechmarino commented 5 years ago

Hi, We have 4 nodes cluster with zfs on each node. For snapshot creation we use latest cv4pve-autosnap version. I can confirm that snapshot creation is going without any issues.

andrewbauman commented 5 years ago

I have one proxmox box onsite and one offsite. I use a vpn to connect the 2 boxes then zrep to sync the data store. Zrep operates at the hypervisor command line. cv4pve works fine as long as I don't have a long running sync. This happens when I get a problem and fall behind with the offsite box. I can run a single sync for up to 3 days. in the past I used zfs-auto-snapshot and that worked well because it didn't have anything to do with the proxmox. However I liked the idea of having snapshots available in the GUI. @cheechmarino what are you using for syncing and are you syncing across a slow WAN?

andrewbauman commented 5 years ago

What happens is zfs send is running for hours on end. cv4pve tries to delete snapshots and gets denied dataset busy error cv4pve leaves vm in locked state. next snap cv4pve fails due to locked. startup after power fail shutdown does not happen because lock is still on.

alekseyp commented 5 years ago

startup after power fail shutdown does not happen because lock is still on.

Yup, Exactly the same issue. Some of my VMs are 1TB+.

I think conflict is between proxmox's storage replication and cv4pve-autosnap.

When there is storage replication happening, cv4pve-autosnap will fail and leave VM in a locked state. At best that will cause no more backup, and worst - it will prevent VM from booting after host restart.

I'll try to do more tests, for now, i have disabled cv4pve-autosnap.

franklupo commented 5 years ago

cv4pve-autosnap executes the same command that comes using the Proxmox VE web guide. If you run via the web you have the same problem. I think the problem is ZFS. My experience on remote replication (sending / receiving) has always given me problems regardless of the tool. You should also see the structure of the ZFS snapshots 'zfs list -t snapshot'

andrewbauman commented 5 years ago

I agree about the problems from replication especially remote. I was hoping this conversation would result in a way for cv4pve-snapshot to run pct or ct unlock $ctID if error contained "dataset busy" The benefit would be that new snaps continue and containers would start upon restart.

franklupo commented 5 years ago

Hi, if I understand correctly do you want a way to unlock VM/CT before taking the snapshot?

alekseyp commented 4 years ago

cv4pve-autosnap executes the same command that comes using the Proxmox VE web guide. If you run via the web you have the same problem

Correct, if we try to delete snapshot via UI while ZFS is busy (running replication for example) it will error out and leave it in the locked stage.

With that being true - I will see the issue and unlock it right away. with automated script, I have no way of knowing that it's locked now.

I think the best approach is to test and see if dataset is busy or not before trying to delete an old snapshot.

andrewbauman commented 4 years ago

@franklupo

if I understand correctly do you want a way to unlock VM/CT before taking the snapshot?

Well maybe. But that doesn't seem like a good way because if the vm does not have lock then snapshot can run. Also if the vm is left in a locked state then it would not start when restarting from update or power outage.

@alekseyp

I think the best approach is to test and see if dataset is busy or not before trying to delete an old snapshot.

I Agree and now I see the trouble it's the Proxmox API command that is leaving the vm in a locked state. Maybe this is a bug in Proxmox API? Is there a command in api that would detect zfs lock state? Is there a API command that can run arbitrary code as root? I doubt it. Maybe the cv4pve can detect error on exit and run pct or ct unlock $vmID but we were told that current code does not check for errors.

andrewbauman commented 4 years ago

@franklupo If we run the command without --keep will proxmox make a snapshot without deleting old snapshots?

franklupo commented 4 years ago

Hi, The --keep is required

Best regards