[BUG ?] Controller on drbd not working and crush proxmox entirely after update

frenzymind commented 4 years ago

Hello. I update proxmox and linstor for latest versions. On host: pve-manager/6.1-7/13e58d5e (running kernel: 5.3.18-1-pve) linstor-client 1.0.12-1
linstor-common 1.4.2-1
linstor-proxmox 4.1.2-1
linstor-satellite 1.4.2-1
python-linstor 1.0.11-1

On controller: linstor-client 1.0.12-1
linstor-common 1.4.2-1
linstor-controller 1.4.2-1 python-linstor 1.0.11-1

And after reboot: If I click on any storage I got communication failure (0) after some time and watching loading spinner Can't run any container: Job for pve-container@109.service failed because a timeout was exceeded. This error even with containers on local storage. Proxmox gui freezes on every menu item. If I comment drbd section in storage.conf and reboot host - then all works well. Then I restore controller from backup but to local storage for now, uncomment drbd storage and reboot again. All works well. After some workaround: if controller located on drbd - then troubles appears and nothing work. If controller on local storage - all work perfect. What is wrong ?

rck commented 4 years ago

So the controller is in a VM on DRBD storage? Is the system configured according to the latest version of the documentation on how to run such systems? (not some outdated blog posts, the latest version of the LINSTOR guide).

In general you can avoid many many many many (...) problems by not having the controller (basically we talk about the controller's DB) on DRBD itself. Using a etcd cluster for the DB works better. Or, and that would be my answer to 99% of all users: Just do regular backups of the DB, most systems are not that dynamic at all, especially ones dealing with real VMs like Proxmox. And if you lose the controller node, start a controller on another node with the backup DB. Easy and simple.

frenzymind commented 4 years ago

Controller was in ct on drbd. Now in ct but on local storage. I am not dure sure about configuration. Can you give a link for actual guide ? Controller sits on drbd for auto recreation on another node in case if current failed. Do it manually is not like HA solution, and takes time. Yes, I do backup of controller db. And when I test it, after creating new controller container and restore db, I have to reboot this node to make the controller work. So, rebooting is not good thing on working node.

rck commented 4 years ago

My guess is you have the same chicken and egg problem as if the controller would be in a VM. How would you start the controller on DRBD if it is needed to bring up the DRBD resource the controller is on. For VMs this can be solved like this: https://docs.linbit.com/docs/linstor-guide/#s-proxmox-ls-HA

LINSTOR controller in a container is completely untested, you are on you own. A similar similar solution as in the VM case should work. This is not a bug in the plugin, you use it in an unintended way, I'm closing this.

frenzymind commented 4 years ago

Controller resource added to satellite exception, so, when controller down, that resource is still reachable. Ok, I try that link guide. Thanks!

LINBIT / linstor-proxmox

[BUG ?] Controller on drbd not working and crush proxmox entirely after update #31