Could not create multiple DRBD replicas on top of shared LUN

kvaps commented 1 year ago

Hi, I have created shared LVM storage-pool by following the documentation on two nodes:

# linstor sp l -s shared-lun
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| StoragePool | Node       | Driver | PoolName   | FreeCapacity | TotalCapacity | CanSnapshots | State | SharedName                             |
|===============================================================================================================================================|
| shared-lun  | hf-virt-02 | LVM    | shared-lun |     6.99 GiB |     10.00 GiB | False        | Ok    | Q8lSH2-axOB-mF5p-xGaL-zNOm-pkY8-GSCqY3 |
| shared-lun  | hf-virt-03 | LVM    | shared-lun |     6.99 GiB |     10.00 GiB | False        | Ok    | Q8lSH2-axOB-mF5p-xGaL-zNOm-pkY8-GSCqY3 |
+-----------------------------------------------------------------------------------------------------------------------------------------------+

But I can't create more than one drbd diskful device on it:

# linstor rd c abcd
# linstor vd c abcd 1G
# linstor r c hf-virt-03 abcd -s shared-lun
# linstor r l -r abcd
+------------------------------------------------------------------------------------+
| ResourceName | Node       | Port | Usage  | Conns |    State | CreatedOn           |
|====================================================================================|
| abcd         | hf-virt-03 | 7030 | Unused | Ok    | UpToDate | 2023-02-27 13:28:46 |
+------------------------------------------------------------------------------------+
# linstor r c hf-virt-02 abcd -s shared-lun
SUCCESS:
    Successfully set property key(s): StorPoolName
SUCCESS:
    Successfully set property key(s): StorPoolName
INFO:
    Tie breaker resource 'abcd' created on DfltDisklessStorPool
INFO:
    Resource-definition property 'DrbdOptions/Resource/quorum' updated from 'off' to 'majority' by auto-quorum
INFO:
    Resource-definition property 'DrbdOptions/Resource/on-no-quorum' updated from 'off' to 'io-error' by auto-quorum
SUCCESS:
Description:
    New resource 'abcd' on node 'hf-virt-02' registered.
Details:
    Resource 'abcd' on node 'hf-virt-02' UUID is: 262e4235-59e8-4c8f-b81d-6e73db738daf
SUCCESS:
Description:
    Volume with number '0' on resource 'abcd' on node 'hf-virt-02' successfully registered
Details:
    Volume UUID is: c7522e61-1f58-433d-ad08-94bb8f17ef48
SUCCESS:
    Added peer(s) 'hf-virt-02' to resource 'abcd' on 'hf-virt-01'
SUCCESS:
    Added peer(s) 'hf-virt-02' to resource 'abcd' on 'hf-virt-03'
ERROR:
    (Node: 'hf-virt-02') Failed to adjust DRBD resource abcd
Show reports:
    linstor error-reports show 63FCA19A-57D8E-000001
# linstor r l -r abcd
+------------------------------------------------------------------------------------------------------------------+
| ResourceName | Node       | Port | Usage  | Conns                             |      State | CreatedOn           |
|==================================================================================================================|
| abcd         | hf-virt-01 | 7030 | Unused | Connecting(hf-virt-02)            | TieBreaker | 2023-02-27 13:28:56 |
| abcd         | hf-virt-02 | 7030 | Unused | StandAlone(hf-virt-03,hf-virt-01) |   Diskless |                     |
| abcd         | hf-virt-03 | 7030 | Unused | Connecting(hf-virt-02)            |   UpToDate | 2023-02-27 13:28:46 |
+------------------------------------------------------------------------------------------------------------------+
# linstor error-reports show 63FCA19A-57D8E-000001
ERROR REPORT 63FCA19A-57D8E-000001

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Satellite
Version:                            1.20.3
Build ID:                           8d19a891df018f6e3d40538d809904f024bfe361
Build time:                         2023-01-27T11:19:21+00:00
Error time:                         2023-02-27 13:28:57
Node:                               hf-virt-02

============================================================

Reported error:
===============

Description:
    Failed to adjust DRBD resource abcd

Category:                           LinStorException
Class name:                         ResourceException
Class canonical name:               com.linbit.linstor.core.devmgr.exceptions.ResourceException
Generated at:                       Method 'adjustDrbd', Source file 'DrbdLayer.java', Line #834

Error message:                      Failed to adjust DRBD resource abcd

Error context:
    An error occurred while processing resource 'Node: 'hf-virt-02', Rsc: 'abcd''

Call backtrace:

    Method                                   Native Class:Line number
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:834
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:901
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:359
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:169
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:322
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1152
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:750
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:644
    run                                      N      java.lang.Thread:829

Caused by:
==========

Description:
    Execution of the external command 'drbdadm' failed.
Cause:
    The external command exited with error code 1.
Correction:
    - Check whether the external program is operating properly.
    - Check whether the command line is correct.
      Contact a system administrator or a developer if the command line is no longer valid
      for the installed version of the external program.
Additional information:
    The full command line executed was:
    drbdadm -vvv adjust abcd

    The external command sent the following output data:
    drbdsetup new-resource abcd 1 --on-no-quorum=io-error --quorum=majority
    drbdsetup new-minor abcd 1023 0
    drbdsetup new-peer abcd 2 --_name=hf-virt-01 --verify-alg=crct10dif-pclmul --shared-secret=kRLtYYrxX/usF3jZMBJg --cram-hmac-alg=sha1
    drbdsetup new-peer abcd 0 --_name=hf-virt-03 --verify-alg=crct10dif-pclmul --shared-secret=kRLtYYrxX/usF3jZMBJg --cram-hmac-alg=sha1
    drbdsetup new-path abcd 2 ipv4:95.217.77.33:7030 ipv4:95.217.77.109:7030
    drbdsetup new-path abcd 0 ipv4:95.217.77.33:7030 ipv4:95.217.77.30:7030
    drbdsetup peer-device-options abcd 2 0 --set-defaults --bitmap=no
    drbdmeta 1023 v09 /dev/shared-lun/abcd_00000 internal apply-al
    drbdsetup attach 1023 /dev/shared-lun/abcd_00000 /dev/shared-lun/abcd_00000 internal --discard-zeroes-if-aligned=no --rs-discard-granularity=8192

    The external command sent the following error information:
    New resource abcd
    New minor 1023 (vol:0)
    1023: Failure: (165) Unclean meta-data found.
    You need to 'drbdadm apply-al res'

    additional info from kernel:
    Found unclean meta data. Did you "drbdadm apply-al"?

    Command 'drbdsetup attach 1023 /dev/shared-lun/abcd_00000 /dev/shared-lun/abcd_00000 internal --discard-zeroes-if-aligned=no --rs-discard-granularity=8192' terminated with exit code 10

Category:                           LinStorException
Class name:                         ExtCmdFailedException
Class canonical name:               com.linbit.extproc.ExtCmdFailedException
Generated at:                       Method 'execute', Source file 'DrbdAdm.java', Line #593

Error message:                      The external command 'drbdadm' exited with error code 1

Call backtrace:

    Method                                   Native Class:Line number
    execute                                  N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:593
    adjust                                   N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:90
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:752
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:901
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:359
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:169
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:322
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1152
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:750
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:644
    run                                      N      java.lang.Thread:829

END OF ERROR REPORT.

I guess that drbd can't even work over shared LUN. I think we shouldn't allow creating the second diskful replica this case.

The proposed solution:

If: create resource -s storage-pool is requested on storage-pool with shared space: then:
- check if this storage-pool contains other diskful resource
- force to create diskless replica and show warning about that.
if: toggle-disk -s storage-pool is requested on diskless resource on storage-pool with shared space then:
- check if this storage-pool contains other diskful resource
- turn this diskful replica to diskless without removing backing LV
- turn diskless replica to diskful on requested node
if: one of diskless resources become primary then:
- freeze it
- automatically turn it to diskful, by the executing procedure above
- unfreeze it

ghernadi commented 1 year ago

Thanks for the notice.

I think we shouldn't allow creating the second diskful replica this case.

At least not an active one. The bug that I see here is that Linstor should have created the second diskful resource automatically with --inactive. That would lead to a situation where the second diskful node will NOT have the DRBD device at all (not even as diskless), but the user can linstor r deactivate the first diskful resource and linstor r activate the second resource to "move" the DRBD device to the second node.

Although I do like the diskless-dancing idea, at least your third point will most likely not be implemented, since that would require some "if this resource gets primary" hook within Linstor, which is not something we intended to do.

The other two points sound like good suggestions but might have some well hidden problems in the details (i.e. correctly managing the node-ids). However, we will think about them.

PS: now that I thought a bit longer about this issue, I think we also have to prohibit having 2 shared resources with different internal/external metadata settings, as that will cause definitely a lot of trouble.

kvaps commented 1 year ago

your third point will most likely not be implemented, since that would require some "if this resource gets primary" hook within Linstor, which is not something we intended to do.

Good point, I think we can implement this additional call for toggle-disk in csi-driver. This will allow us to handle the cases of recreating pods and live-migration of virtual machines the more smart way.

kvaps commented 1 year ago

@ghernadi is it possible to perform this procedure by request if one of resouce is already InUse? Eg, by invoking drbdadm disconnect with --force, then toggling it into diskful?

if: toggle-disk -s storage-pool is requested on diskless resource on storage-pool with shared space then:

check if this storage-pool contains other diskful resource

turn this diskful replica to diskless without removing backing LV

turn diskless replica to diskful on requested node

ghernadi commented 1 year ago

@ghernadi is it possible to perform this procedure by request if one of resouce is already InUse? Eg, by invoking drbdadm disconnect with --force, then toggling it into diskful?

Honestly, I would like to avoid using disconnect --force as that could quite easily lead to unintended behavior.

kvaps commented 1 year ago

So sad, I don't see any option to perform the live-migration for VMs in the shared-pool and keep data locality for them :(

kvaps commented 1 year ago

Okay, I made a bit investigation and found that resources with external metadata could have more than one diskful drbd replica on shared-lun. I'm not sure how this configuration is dangerous, since all the data will be written on same device twice but it works with no problems. In theory protocol C should make it safe.

Considering the fact that this is the only possible way to live-migrate the vm on shared-lun and keep data-locality, I would suggest the following changes:

do not allow using internal metadata on shared pool at all (this also covers the case when user requested various layer stack for two resources on same shared pool)
do not block creation of more than one diskful drbd replica on shared pool, but show the warning: "Having two diskful replicas will double the number of write requests on backing block device"

Ideally I would love to have the opportunity from DRBD to work with shared meta-disk, or at least suppress write requests between the nodes with shared data-disk. This way all the replicas on shared-pool could be diskful (as they really are)

kvaps commented 1 year ago

Or some option to explain DRBD that data between two peers is always Consistent to not perform actual synchronization between them

LINBIT / linstor-server

Could not create multiple DRBD replicas on top of shared LUN #340