LINBIT / linstor-proxmox

Integration pluging bridging LINSTOR to Proxmox VE
31 stars 7 forks source link

auto-add-quorum-tie-breaker not honoured for containers (only) #45

Open acidrop opened 3 years ago

acidrop commented 3 years ago

I'm doing some tests regarding quorum auto tie breaker and I noticed that even though it's honoured for qemu VMs, it does not for LXC containers. Not sure if this the right place to post this, but I thought that since this occurs on Proxmox, it might be more related to its plugin.

Here's a sequence of the commands...

root@pve1:~# linstor rg l -p +-------------------------------------------------------------------+ ResourceGroup SelectFilter VlmNrs Description =================================================================== DfltRscGrp PlaceCount: 2
drbd-rg01 PlaceCount: 2 0
StoragePool(s): drbdpool
DisklessOnRemaining: False
ProviderList: ['ZFS_THIN']
-------------------------------------------------------------------
drbd-rg02 PlaceCount: 2 0 ThinLVM
StoragePool(s): thinpool01
DisklessOnRemaining: False
ProviderList: ['LVM_THIN']

+-------------------------------------------------------------------+

root@pve1:~# linstor c lp|grep quorum | DrbdOptions/Resource/quorum | majority | | DrbdOptions/auto-add-quorum-tiebreaker | true | | DrbdOptions/auto-quorum | io-error

root@pve1:~# linstor rd lp vm-109-disk-1 -p +-----------------------------------------------------------+ | Key | Value | |===========================================================| | DrbdOptions/Net/allow-two-primaries | yes | | DrbdOptions/Resource/quorum | off | | DrbdOptions/auto-add-quorum-tiebreaker | False | | DrbdOptions/auto-verify-alg | crct10dif-pclmul | | DrbdPrimarySetOn | PVE3 | +-----------------------------------------------------------+

Shouldn't the property inherited from the controller set properties ?...anyway I add it manually...

root@pve1:~# linstor rd sp vm-109-disk-1 DrbdOptions/auto-add-quorum-tiebreaker True SUCCESS: Successfully set property key(s): DrbdOptions/auto-add-quorum-tiebreaker SUCCESS: Description: Resource definition 'vm-109-disk-1' modified. Details: Resource definition 'vm-109-disk-1' UUID is: e9847acd-efb8-42bc-9bf4-7f7becf126d5 SUCCESS: (pve2) Resource 'vm-109-disk-1' [DRBD] adjusted. SUCCESS: (pve3) Resource 'vm-109-disk-1' [DRBD] adjusted.

root@pve1:~# linstor rd lp vm-109-disk-1 -p +-----------------------------------------------------------+ | Key | Value | |===========================================================| | DrbdOptions/Net/allow-two-primaries | yes | | DrbdOptions/Resource/quorum | off | | DrbdOptions/auto-add-quorum-tiebreaker | true | | DrbdOptions/auto-verify-alg | crct10dif-pclmul | | DrbdPrimarySetOn | PVE3 | +-----------------------------------------------------------+

root@pve1:~# linstor r l|grep 109 | vm-109-disk-1 | pve2 | 7005 | InUse | Ok | UpToDate | 2021-07-21 09:24:58 | | vm-109-disk-1 | pve3 | 7005 | Unused | Ok | UpToDate | 2021-07-21 09:24:56 |

root@pve1:~# ssh pve2 'pct migrate 109 pve4 --restart' 2021-07-21 09:34:57 shutdown CT 109 2021-07-21 09:35:01 use dedicated network address for sending migration traffic (10.10.13.4) 2021-07-21 09:35:02 starting migration of CT 109 to node 'pve4' (10.10.13.4) 2021-07-21 09:35:02 volume 'linstor-thinlvm:vm-109-disk-1' is on shared storage 'linstor-thinlvm' 2021-07-21 09:35:02 start final cleanup 2021-07-21 09:35:03 start container on target node 2021-07-21 09:35:03 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve4' root@10.10.13.4 pct start 109 2021-07-21 09:35:08 migration finished successfully (duration 00:00:11)

root@pve1:~# linstor r l|grep 109 | vm-109-disk-1 | pve2 | 7005 | Unused | Ok | UpToDate | 2021-07-21 09:24:58 | | vm-109-disk-1 | pve3 | 7005 | Unused | Ok | UpToDate | 2021-07-21 09:24:56 | | vm-109-disk-1 | pve4 | 7005 | InUse | Ok | Diskless | 2021-07-21 09:35:05 |

root@pve1:~# ssh pve4 'pct migrate 109 pve2 --restart' 2021-07-21 09:36:23 shutdown CT 109 2021-07-21 09:36:26 use dedicated network address for sending migration traffic (10.10.13.2) 2021-07-21 09:36:27 starting migration of CT 109 to node 'pve2' (10.10.13.2) 2021-07-21 09:36:27 volume 'linstor-thinlvm:vm-109-disk-1' is on shared storage 'linstor-thinlvm'

NOTICE Intentionally removing diskless assignment (vm-109-disk-1) on (pve4). It will be re-created when the resource is actually used on this node. 2021-07-21 09:36:28 start final cleanup 2021-07-21 09:36:30 start container on target node 2021-07-21 09:36:30 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@10.10.13.2 pct start 109 2021-07-21 09:36:37 migration finished successfully (duration 00:00:14)

root@pve1:~# linstor r l|grep 109 | vm-109-disk-1 | pve2 | 7005 | InUse | Ok | UpToDate | 2021-07-21 09:24:58 | | vm-109-disk-1 | pve3 | 7005 | Unused | Ok | UpToDate | 2021-07-21 09:24:56 |

root@pve1:~# linstor advise r -p|grep 109 | vm-109-disk-1 | Resource has 2 replicas but no tie-breaker, could lead to split brain. | linstor rd ap -d --place-count 1 vm-109-disk-1 |

root@pve1:~# linstor rd ap -d --place-count 1 vm-109-disk-1 usage: linstor [-h] [--version] [--no-color] [--no-utf8] [--warn-as-error] [--curl] [--controllers CONTROLLERS] [-m] [--output-version {v0,v1}] [--verbose] [-t TIMEOUT] [--disable-config] [--user USER] [--password PASSWORD] [--certfile CERTFILE] [--keyfile KEYFILE] [--cafile CAFILE] [--allow-insecure-auth] {advise, controller, drbd-proxy, encryption, error-reports, exos, help, interactive, list-commands, node, physical-storage, resource, resource-connection, resource-definition, resource-group, snapshot, sos-report, space-reporting, storage-pool, volume, volume-definition, volume-group} ... linstor: error: unrecognized arguments: -d

Not sure if the above advise is correct in the first place ?

rck commented 3 years ago

The "unrecognized argument" looks like a client bug (@rp- ). The rest is "something else", there is no difference between VMs and containers in the plugin. @ghernadi might see what happens here quicker, he knows the quorum/tiebreaker rules ways better. Might be even as intended, I don't know. Other guess: lvm vs. zfs, but probably unlikely.

rp- commented 3 years ago

well the advise command seems just wrong here, I guess it should be --drbd-diskless instead of -d @WanzenBug ?

ghernadi commented 3 years ago

Shouldn't the property inherited from the controller set properties ?...anyway I add it manually...

Well, yes, unless the ResourceDefinition overrules the otherwise inherited controller-property. The same property can be set on multiple levels (Controller, ResourceGroup, ResourceDefinition, etc...). As a rule of thumb: The closer the property is to the actual volume (LVM / ZFS / ...) the higher its priority. In this case the False from the ResourceDefinition had a higher priority than the True from the Controller.

Regarding the rest: please update the linstor-client. We recently changed that linstor r l will now show all resources, including the tie-breaker-resources that were "hidden by default" in previous versions and needed a linstor r l -a to be shown.

That means, I assume that you indeed had your tie-breaking resource deployed as expected, but it was simply hidden by the mentioned client behavior. I suspect this due to these logs:

root@pve1:~# linstor r l|grep 109
| vm-109-disk-1 | pve2 | 7005 | InUse | Ok | UpToDate | 2021-07-21 09:24:58 |
| vm-109-disk-1 | pve3 | 7005 | Unused | Ok | UpToDate | 2021-07-21 09:24:56 |

only 2 resources shown here

root@pve1:~# ssh pve2 'pct migrate 109 pve4 --restart'
...
root@pve1:~# linstor r l|grep 109
| vm-109-disk-1 | pve2 | 7005 | Unused | Ok | UpToDate | 2021-07-21 09:24:58 |
| vm-109-disk-1 | pve3 | 7005 | Unused | Ok | UpToDate | 2021-07-21 09:24:56 |
| vm-109-disk-1 | pve4 | 7005 | InUse | Ok | Diskless | 2021-07-21 09:35:05 |

Suddenly you have 3 resources. A tie-breaking resource that gets promoted (DRBD primary) immediately loses its TIE_BREAKER flag and gets "degraded" from a linstor-managed TIE_BREAKER to a user-managed DISKLESS resource. The only difference between tiebreaker and diskless is that Linstor is brave enough to automatically remove the tiebreaker in case it is no longer needed. A diskless is never touched - only taken over in case you try to delete the diskless but Linstor is configured to keep it as tiebreaker. You can conveniently overrule this "takeover" logic by deleting the now tiebreaker resource. However, in that case Linstor assumes that you do not want to have a tiebreaker on this ResourceDefinition, which means that Linstor also sets the DrbdOptions/auto-add-quorum-tiebreaker | False | property, which might explain why it was set in your setup..

acidrop commented 3 years ago

In this case RD properties are set automatically during the VM/CT creation process via the LINSTOR Proxmox plugin. Only difference I can see between VM and CT is that the 1st is being live migrated while the 2nd needs to be restarted (shutdown/start). Could that play a role on this ?

There are no properties set at the RG level..

root@pve1:~# linstor rg lp drbd-rg01 -p
+-------------+
| Key | Value |
|=============|
+-------------+
root@pve1:~# linstor rg lp drbd-rg02 -p
+-------------+
| Key | Value |
|=============|
+-------------+

Below follow some more tests, both when creating a Qemu VM and a LXC container... As you will notice, the tiebreaker resource is created correctly for the VM (vm-109-disk-1) but not for the CT (vm-110-disk-1). I did also add the "-a" parameter during "linstor r l' command, but it does not seem to be visible even with that.

QEMU VM=vm-109-disk-1 LXC CT=vm-110-disk-1

# Resource definition properties as created by Proxmox during VM/CT creating process…

root@pve1:~# linstor rd lp vm-109-disk-1 -p
+--------------------------------------------------------+
| Key                                 | Value            |
|========================================================|
| DrbdOptions/Net/allow-two-primaries | yes              |
| DrbdOptions/Resource/quorum         | off              |
| DrbdOptions/auto-verify-alg         | crct10dif-pclmul |
| DrbdPrimarySetOn                    | PVE3             |
+--------------------------------------------------------+

root@pve1:~# linstor rd lp vm-110-disk-1 -p
+--------------------------------------------------------+
| Key                                 | Value            |
|========================================================|
| DrbdOptions/Net/allow-two-primaries | yes              |
| DrbdOptions/Resource/quorum         | off              |
| DrbdOptions/auto-verify-alg         | crct10dif-pclmul |
| DrbdPrimarySetOn                    | PVE2             |
+--------------------------------------------------------+

root@pve1:~# linstor r l -a|grep vm-109
| vm-109-disk-1 | pve2 | 7005 | Unused | Ok    |   UpToDate | 2021-07-21 11:01:10 |
| vm-109-disk-1 | pve3 | 7005 | InUse  | Ok    |   UpToDate | 2021-07-21 11:01:07 |

root@pve1:~# ssh pve3 'qm migrate 109 pve4 --online'
2021-07-21 11:06:45 use dedicated network address for sending migration traffic (10.10.13.4)
2021-07-21 11:06:45 starting migration of VM 109 to node 'pve4' (10.10.13.4)
2021-07-21 11:06:45 starting VM 109 on remote node 'pve4'
2021-07-21 11:06:50 start remote tunnel
2021-07-21 11:06:51 ssh tunnel ver 1
2021-07-21 11:06:51 starting online/live migration on tcp:10.10.13.4:60000
2021-07-21 11:06:51 set migration capabilities
2021-07-21 11:06:52 migration downtime limit: 100 ms
2021-07-21 11:06:52 migration cachesize: 256.0 MiB
2021-07-21 11:06:52 set migration parameters
2021-07-21 11:06:52 start migrate command to tcp:10.10.13.4:60000
2021-07-21 11:06:53 average migration speed: 2.0 GiB/s - downtime 16 ms
2021-07-21 11:06:53 migration status: completed
2021-07-21 11:06:56 migration finished successfully (duration 00:00:11)

root@pve1:~# linstor r l -a|grep vm-109
| vm-109-disk-1 | pve2 | 7005 | Unused | Ok    |   UpToDate | 2021-07-21 11:01:10 |
| vm-109-disk-1 | pve3 | 7005 | Unused | Ok    |   UpToDate | 2021-07-21 11:01:07 |
| vm-109-disk-1 | pve4 | 7005 | InUse  | Ok    |   Diskless | 2021-07-21 11:06:47 |

root@pve1:~# ssh pve4 'qm migrate 109 pve3 --online'
2021-07-21 11:07:51 use dedicated network address for sending migration traffic (10.10.13.3)
2021-07-21 11:07:51 starting migration of VM 109 to node 'pve3' (10.10.13.3)
2021-07-21 11:07:51 starting VM 109 on remote node 'pve3'
2021-07-21 11:07:55 start remote tunnel
2021-07-21 11:07:56 ssh tunnel ver 1
2021-07-21 11:07:56 starting online/live migration on tcp:10.10.13.3:60000
2021-07-21 11:07:56 set migration capabilities
2021-07-21 11:07:56 migration downtime limit: 100 ms
2021-07-21 11:07:56 migration cachesize: 256.0 MiB
2021-07-21 11:07:56 set migration parameters
2021-07-21 11:07:56 start migrate command to tcp:10.10.13.3:60000
2021-07-21 11:07:57 average migration speed: 2.0 GiB/s - downtime 6 ms
2021-07-21 11:07:57 migration status: completed

NOTICE
  Intentionally removing diskless assignment (vm-109-disk-1) on (pve4).
  It will be re-created when the resource is actually used on this node.
2021-07-21 11:08:01 migration finished successfully (duration 00:00:10)

root@pve1:~# linstor r l -a|grep vm-109
| vm-109-disk-1 | pve2 | 7005 | Unused | Ok    |   UpToDate | 2021-07-21 11:01:10 |
| vm-109-disk-1 | pve3 | 7005 | InUse  | Ok    |   UpToDate | 2021-07-21 11:01:07 |
| vm-109-disk-1 | pve4 | 7005 | Unused | Ok    | TieBreaker | 2021-07-21 11:06:47 |

root@pve1:~# linstor rd lp vm-109-disk-1 -p
+--------------------------------------------------------+
| Key                                 | Value            |
|========================================================|
| DrbdOptions/Net/allow-two-primaries | yes              |
| DrbdOptions/Resource/on-no-quorum   | io-error         |
| DrbdOptions/Resource/quorum         | majority         |
| DrbdOptions/auto-verify-alg         | crct10dif-pclmul |
| DrbdPrimarySetOn                    | PVE3             |
+--------------------------------------------------------+

root@pve1:~# linstor r l -a|grep vm-110
| vm-110-disk-1 | pve2 | 7010 | InUse  | Ok    |   UpToDate | 2021-07-21 11:04:55 |
| vm-110-disk-1 | pve3 | 7010 | Unused | Ok    |   UpToDate | 2021-07-21 11:04:57 |

root@pve1:~# ssh pve2 'pct migrate 110 pve4 --restart'
2021-07-21 11:16:53 shutdown CT 110
2021-07-21 11:16:57 use dedicated network address for sending migration traffic (10.10.13.4)
2021-07-21 11:16:57 starting migration of CT 110 to node 'pve4' (10.10.13.4)
2021-07-21 11:16:57 volume 'linstor-thinlvm:vm-110-disk-1' is on shared storage 'linstor-thinlvm'
2021-07-21 11:16:58 start final cleanup
2021-07-21 11:16:59 start container on target node
2021-07-21 11:16:59 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve4' root@10.10.13.4 pct start 110
2021-07-21 11:17:04 migration finished successfully (duration 00:00:11)

root@pve1:~# linstor r l -a|grep vm-110
| vm-110-disk-1 | pve2 | 7010 | Unused | Ok    |   UpToDate | 2021-07-21 11:04:55 |
| vm-110-disk-1 | pve3 | 7010 | Unused | Ok    |   UpToDate | 2021-07-21 11:04:57 |
| vm-110-disk-1 | pve4 | 7010 | InUse  | Ok    |   Diskless | 2021-07-21 11:17:01 |

root@pve1:~# linstor rd lp vm-110-disk-1 -p
+-----------------------------------------------------------+
| Key                                    | Value            |
|===========================================================|
| DrbdOptions/Net/allow-two-primaries    | yes              |
| DrbdOptions/Resource/on-no-quorum      | io-error         |
| DrbdOptions/Resource/quorum            | majority         |
| DrbdOptions/auto-add-quorum-tiebreaker | False            |
| DrbdOptions/auto-verify-alg            | crct10dif-pclmul |
| DrbdPrimarySetOn                       | PVE2             |
+-----------------------------------------------------------+

root@pve1:~# ssh pve4 'pct migrate 110 pve2 --restart'
2021-07-21 11:20:17 shutdown CT 110
2021-07-21 11:20:21 use dedicated network address for sending migration traffic (10.10.13.2)
2021-07-21 11:20:21 starting migration of CT 110 to node 'pve2' (10.10.13.2)
2021-07-21 11:20:21 volume 'linstor-thinlvm:vm-110-disk-1' is on shared storage 'linstor-thinlvm'
2021-07-21 11:20:21 start final cleanup
2021-07-21 11:20:23 start container on target node
2021-07-21 11:20:23 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@10.10.13.2 pct start 110
2021-07-21 11:20:30 migration finished successfully (duration 00:00:13)

root@pve1:~# linstor rd lp vm-110-disk-1 -p
+-----------------------------------------------------------+
| Key                                    | Value            |
|===========================================================|
| DrbdOptions/Net/allow-two-primaries    | yes              |
| DrbdOptions/Resource/quorum            | off              |
| DrbdOptions/auto-add-quorum-tiebreaker | False            |
| DrbdOptions/auto-verify-alg            | crct10dif-pclmul |
| DrbdPrimarySetOn                       | PVE2             |
+-----------------------------------------------------------+

root@pve1:~# linstor r l -a|grep vm-110
| vm-110-disk-1 | pve2 | 7010 | InUse  | Ok    |   UpToDate | 2021-07-21 11:04:55 |
| vm-110-disk-1 | pve3 | 7010 | Unused | Ok    |   UpToDate | 2021-07-21 11:04:57 |
acidrop commented 3 years ago

Ok, after some further testing it looks like this is not a LINSTOR issue, as when I create RD/VD/Resource directly via linstor in the command line, auto-tie-breaker is automatically created once I delete the Diskless resource from the respective node. The property is correctly inherited from the Controller as expected in this case.

The "issue" seems to be related to Proxmox/LINSTOR plugin and how it handles "live migration" and "shutdown/start" actions no matter if that's a VM or a CT. So, when executing a Live Migrate action on a VM to a Diskless node and then after migrate it back to a Diskful node, Linstor correctly marks the Diskless resource as a quorum tie breaker (i.e it does not delete it). When executing a shutdown action on a VM or a CT which is located on a Diskless node, then it deletes its Diskless resource from that node (which makes sense). All in all this looks like an expected rather than a strange behaviour.

So to summarise, in order for the auto-tie-breaker resource to be created in Proxmox, there are 2 options:

  1. For Qemu VMs: Live Migrate the VM from a Diskful node to a Diskless node and then after migrate it back to the Diskful node.

  2. For LXC CTs: Manually create a Diskless resource on a node (i.e linstor r c -d pve4 vm-110-disk-1) and then after delete it (i.e linstor r d pve4 vm-110-disk-1). In this way "The given resource will not be deleted but will be taken over as a linstor managed tiebreaker resource."

rck commented 3 years ago

hm, yes, without thinking it through completely, such things could happen. The "when to create a diskless and when to remove it" logic is currently in the plugin: if there is none whatsoever, create a diskless one. If moved away and it is diskless, just delete it. That in combination with auto-tiebreaker might have funny consequences.

LINSTOR now can handle that on its own, there is a "make available" API that does the right thing and handles more complicated storage situations. The plugin has not switched to that API. So let's keep this open as tracking issue.