NAUbackup / VmBackup

XenServer simple backup script
228 stars 61 forks source link

Crash when medium is running out of space? #78

Open VARGA-Peter opened 6 years ago

VARGA-Peter commented 6 years ago

While I ran into #77 I think these issues exist:

  1. Can it be there is a uncaught program crash when the backup medium is running out of space?
  2. I also noticed that the snapshots remained on the host.
  3. No email was sent.

May be you find time to check it out.

Thank you

NAUbackup commented 6 years ago

A precheck is not a bad idea. When things crash, there often are remnants and if it crashes, there won't be of course email sent.

VARGA-Peter commented 6 years ago

Below the email I received when I cleaned up the backup medium and restarted the backup again. You see that the fail for APA-SBS201WX was reported correctly [2018/05/26 04:46:40] - because there was not enough of space - but then APA-TS230W became a zombie and the script crashed.

I don't know Python but in my C# and C++ code I wrap all functions into a try/catch block where I send an email that something got wrong because before I had the same problem that I wasn't notified in such situations.

2018/05/26 04:00:01,vmbackup.py,APA92,begin
2018/05/26 04:00:01,vm-export,APA92,begin,APA-APP231LY
2018/05/26 04:02:49,vm-export,APA92,end,SUCCESS APA-APP231LY,elapse:2 size:6G
2018/05/26 04:02:49,vm-export,APA92,begin,APA-SBS201WX
2018/05/26 04:46:40,vm-export,APA92,end,VM-EXPORT-FAIL APA-SBS201WX
                                         ^ this is OK and correct
2018/05/26 04:46:40,vm-export,APA92,begin,APA-TS230W
        and now we have undefined behaviour
2018/05/26 13:00:01,vmbackup.py,APA92,begin
2018/05/26 13:00:01,vm-export,APA92,begin,APA-APP231LY
2018/05/26 13:02:48,vm-export,APA92,end,SUCCESS APA-APP231LY,elapse:2 size:6G
2018/05/26 13:02:48,vm-export,APA92,begin,APA-SBS201WX
2018/05/26 13:56:45,vm-export,APA92,end,SUCCESS APA-SBS201WX,elapse:53 size:183G
2018/05/26 13:56:45,vm-export,APA92,begin,APA-TS230W
2018/05/26 14:26:15,vm-export,APA92,end,SUCCESS APA-TS230W,elapse:29 size:92G
2018/05/26 14:26:15,vm-export,APA92,begin,APA-BKP102WX
2018/05/26 14:33:38,vm-export,APA92,end,SUCCESS APA-BKP102WX,elapse:7 size:19G
2018/05/26 14:33:38,vmbackup.py,APA92,end,SUCCESS,S:4 W:0 E:0
NAUbackup commented 6 years ago

Yes, python has a try/except construct that could be used to that extent.

Note that cleanup does take place if an old attempted backup is found using process_backup_dir(tmp_vm_backup_dir) and the embedded function get_last_backup_dir_that_failed, so this does get cleaned up, but not until the next run. The issue is currently that if the script fails or is killed off by someone, any pre-cleanup may not be caught in time to act upon, hence the checking for failed backups when the script is launched afresh. In part, it's a matter of philosophy of approach. This check is done both for VDI and full VM backups.

NAUbackup commented 6 years ago

As to a precheck, what might be useful: A warning if the available space is over 90% full? If the backup fails, the nature of the failure would have to be understood (such as the lack of disk space). I don't think the script would be clever enough to trap the condition that caused the failure., and usually if something happens with a VM, it typically is desirable to go ahead with the others, assuming the error only affected the one VM.

VARGA-Peter commented 6 years ago

I think 90% is just a number and in my situation it wouldn't help as you can see from the above email. SBS needs 60% of the total space of all VMs. As I mentioned before adding the possibility to remove the oldest version FIRST would solve this problem.

Yes, it is my logical mistake that I didn't consider this when I created the iSCSI storage for the backup. The point is that now the NAS is almost full and it is not that trivial to reorganize it because then I have to change all storage assignments.

NAUbackup commented 6 years ago

Could maybe do this: Look for the largest existing backup for a VM and see if there is at least that much space still available (+10% or so) before starting a backup series on a particular VM? Would have to handle both NFS and CIFS storage.