JTok / unraid.vmbackup

a plugin for backing up VMs in unRAID including vdisks and configuration files.
51 stars 4 forks source link

Plugin causing unraid unable to shutdown, startup errors #18

Open KptnKMan opened 4 years ago

KptnKMan commented 4 years ago

This issue is to officially track an issue encountered and reported in the official thread for this plugin, in the official unraid forums.

I've been using this plugin to backup my VMs for a couple weeks now, but unfortunately I've found that this this is the cause of my server being unable to shutdown.

My unRAID was unable to shutdown and would freeze forcing me to hard kill the system, causing a parity check every time. I do not want to do this for an otherwise stable system.

Rolled back from 6.9.0-b1 to 6.8.3 didn't solve it. Running in safe mode showed that everything worked, but I couldn't start my VMs due to the Unassigned devices plugin. Uninstalling the VM Backup plugin solved the issue, and removed the error at startup. Something in the VM Backup plugin is breaking the Hypervisor and messing with my bonded network connection.

From what I can tell, first there's the "error: failed to connect to the hypervisor" error: Failed to connect socket to '/varrun/libvirt/libvirt-sock' : No such file or directory" image

After this, I seemed to be getting some kind of trace error when shutting down: image

More shutdown: image

And finally stalls here forever (I've waited a day for this, and it didn't shutdown😞 image

Uninstalling the VM Backup Plugin fixed the issue, and I can now shutdown/reboot without stalling, crashing and parity check. The errors have gone also.

It is a real shame because I use this plugin daily (Nightly). I've gone back to using the bash script of the same, intended for use with the CA User Scripts plugin. This works using all the same settings.

Does anyone know why this is happening? I'd like to use this plugin.

JTok commented 4 years ago

I'm having a hard time tracking down how the plugin could be causing this issue (not saying it isn't, just saying I haven't found the connection yet). On the back-end it just runs the same script, and the front-end is just using the standard unraid web interface.

One thing that comes to mind is that maybe the backup script isn't completing entirely, and that is causing things to hang.

I assume your backups were finishing correctly?

KptnKMan commented 4 years ago

The backups finished correctly, as far as I can tell. Haven't tested a restore yet, plan to at some point.

However, your plugin is running a service somewhere, as I understand it has to scheduke itself.

Either way, there are multiple threads in the official forum talking about this, and for not just me uninstalling the plugin fixes issues.

JTok commented 4 years ago

The plugin actually just uses cron to schedule itself, so there is no service running...

🤔 Unfortunately I haven't noticed the issue myself, but I think I may have an idea of what could be causing it. Any chance you would be willing to try something for me to test it? I would just need you to install the plugin, and set the default profile to backup up something on a schedule but DO NOT run it. Just configure it (you don't need to uninstall the script or anything as we will just need the plugin for one reboot). Then just try to reboot and see if it gives you the error.

I think it is possible that one of a couple of different things isn't closing out properly. If it is the the script itself, in theory, if you install the plugin and don't run anything it should reboot and work like normal.

JTok commented 4 years ago

Also, were you running any pre or post scripts?

KptnKMan commented 4 years ago

Not running any pre or post scripts. I have the User Scripts plugin installed, but that's to be assumed I think.

I can install the plugin again, as I'd really like to help this to get fixed, but (for the record) I'm not keen on hard-resetting my unraid system.

I'll test this tonight, per your instructions and post back.

JTok commented 4 years ago

Yeah, the user scripts plugin is a given with unraid as far as I'm concerned. I use it for a lot of other things.

That's completely fair. I appreciate your willingness to help test in spite of that. Thanks!

KptnKMan commented 4 years ago

@JTok sorry for the delay in getting back to you, had some family shenanigans, so I've just now had some time to sit down and test this out without interruption.

With that being said, I ran through what you asked...

So I rebooted a few times (5), and noted the results. Note that everything is running unRAID v6.8.3, still reverted from 6.9.0-beta1 when I first encountered the issues weeks ago. No issues observed during the time since removing the plugin, and using the manual script via User Scripts plugin.

First (1) reboot as control (without plugin), everything worked fine, no VMs running. Before reboot I verified everything was working and updated all my other plugins. Turned off AUTOSTART for all VMs. No startup or shutdown issues observed this reboot. No startup or shutdown issues observed on previous reboots, with VMs running at shutdown time.

Second (2) reboot (plugin installed), no VMs running. Before reboot, made sure to install and configure plugin with schedule, but NOT RUN IT. Upon reboot, I noted the startup error reappeared: "error: failed to connect to the hypervisor". image

Third (3) reboot (plugin installed), no VMs running. Took some time to shutdown, longer than usual. While shutting down, noted that the system wait time of 90 seconds was passed, and a forced shutdown was initiated. Also libvirt is reported as not running. image Everything else appeared to unmount and shutdown without issue. image Same startup error was noted after reboot.

Fourth (4) reboot (plugin still installed), VMs running. Took some time to shutdown, longer than usual, assumed to be 90-second wait and forced. Could not observe screen due to desktop VM running (Test normal operational scenario). Same issues noted as previous reboot (3).

Fifth (5) reboot (plugin UNINSTALLED), no VMs running. Shutting down, noted that 90 second wait time was not passed, and a forced shutdown was not initiated. Libvirt still reported as not running. image image

Sixth (6) reboot (plugin still uninstalled), no VMs running. Currently testing, will update this with an edit and final screenshot. Intended result is that no startup or forced shutdown issues observed.

KptnKMan commented 4 years ago

Update: Sixth (6) reboot (plugin uninstalled), no VMs running. Noted the 90 second wait time was reached, and forced reboot was initiated. Unknown why this is, testing further reboots to check if this is consistent. Noted libvirt is still reported as not running.

Seventh (7) reboot (plugin uninstalled), no VMs running. No issues noted. Noted libvirt is still reported as not running.

Eighth (8) and Ninth (9) reboot (plugin uninstalled), no VMs running. No issues noted. Noted libvirt is still reported as not running.

Subsequent reboots with VMs running seem fine. I have not (today) tested any reboots after running the plugin, since the results of that are already evident when I reported this issue.

Lastly: In hindsight, the reporting of "libvirt is not running..." might be a normal info message, as it seems consistent. I need to investigate my logs to verify this.

KptnKMan commented 4 years ago

@JTok have you have any chances to have a look at any of this? Trying to find out if there's anything I can do.

JTok commented 3 years ago

My apologies for the extended delay. I never intended to abandon this project for so long, but life got in the way.

I have released an updated version of the plugin that is designed to be more aggressive when killing scripts. I'm hoping that is at least one part of the problem, though I'm doubtful.

I am hoping to get back into this and start fielding some of the issues people are having.