LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
988 stars 76 forks source link

Short timeout for drbdadm operations #323

Open dimm0 opened 2 years ago

dimm0 commented 2 years ago

I’m seeing the DRBD having a too short timeout for drbdadm (provision volume) and mkfs operations, resulting in it being unable to create large volumes or format those. It keeps retrying, but leaving the broken volume every time.

I first tried running it on a 1PB zfs node and provision several volumes from 1PB to 100TB, all failed. Then on mdraid node it was unable to do mkfs.xfs a 50TB volume, which is taking more than a minute to complete. Smaller (50GB) volumes are working fine.

Here’s the error:

ERROR REPORT 6361BD8C-BCA58-000000

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Satellite
Version:                            1.20.0-rc.1
Build ID:                           095b9bef67d46f217ee394e0262c4e96baef0c45
Build time:                         2022-09-20T12:44:59+00:00
Error time:                         2022-11-02 01:07:59
Node:                               hcc-nrp-shor-c6005.unl.edu

============================================================

Reported error:
===============

Description:
    Failed to mfks /dev/drbd1000
Cause:
    External command timed out
Additional information:
    External command: mkfs.xfs -q /dev/drbd1000

Category:                           LinStorException
Class name:                         StorageException
Class canonical name:               com.linbit.linstor.storage.StorageException
Generated at:                       Method 'genericExecutor', Source file 'Commands.java', Line #118

Error message:                      Failed to mfks /dev/drbd1000

Error context:
    An error occurred while processing resource 'Node: 'hcc-nrp-shor-c6005.unl.edu', Rsc: 'pvc-1914fe33-bb42-4626-a878-6ac2e5d7882b''

Call backtrace:

    Method                                   Native Class:Line number
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:118
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:61
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:49
    makeFs                                   N      com.linbit.linstor.layer.storage.utils.MkfsUtils:93
    makeXfs                                  N      com.linbit.linstor.layer.storage.utils.MkfsUtils:115
    makeFileSystemOnMarked                   N      com.linbit.linstor.layer.storage.utils.MkfsUtils:202
    condInitialOrSkipSync                    N      com.linbit.linstor.layer.drbd.DrbdLayer:1653
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:806
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:394
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

Caused by:
==========

Category:                           Exception
Class name:                         ChildProcessTimeoutException
Class canonical name:               com.linbit.ChildProcessTimeoutException
Generated at:                       Method 'waitFor', Source file 'ChildProcessHandler.java', Line #133

Call backtrace:

    Method                                   Native Class:Line number
    waitFor                                  N      com.linbit.extproc.ChildProcessHandler:133
    syncProcess                              N      com.linbit.extproc.ExtCmd:156
    exec                                     N      com.linbit.extproc.ExtCmd:90
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:77
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:61
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:49
    makeFs                                   N      com.linbit.linstor.layer.storage.utils.MkfsUtils:93
    makeXfs                                  N      com.linbit.linstor.layer.storage.utils.MkfsUtils:115
    makeFileSystemOnMarked                   N      com.linbit.linstor.layer.storage.utils.MkfsUtils:202
    condInitialOrSkipSync                    N      com.linbit.linstor.layer.drbd.DrbdLayer:1653
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:806
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:394
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

END OF ERROR REPORT.
dimm0 commented 2 years ago

Resize also fails

ERROR REPORT 6361DB5E-BCA58-000001

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Satellite
Version:                            1.20.0
Build ID:                           9c6f7fad48521899f7a99c564b1d33aeacfdbfa8
Build time:                         2022-10-18T07:19:30+00:00
Error time:                         2022-11-02 03:07:23
Node:                               hcc-nrp-shor-c6005.unl.edu

============================================================

Reported error:
===============

Description:
    Failed to adjust DRBD resource pvc-3d40ff9a-ec83-4985-9d77-bc073256ad15

Category:                           LinStorException
Class name:                         ResourceException
Class canonical name:               com.linbit.linstor.core.devmgr.exceptions.ResourceException
Generated at:                       Method 'adjustDrbd', Source file 'DrbdLayer.java', Line #819

Error message:                      Failed to adjust DRBD resource pvc-3d40ff9a-ec83-4985-9d77-bc073256ad15

Error context:
    An error occurred while processing resource 'Node: 'hcc-nrp-shor-c6005.unl.edu', Rsc: 'pvc-3d40ff9a-ec83-4985-9d77-bc073256ad15''

Call backtrace:

    Method                                   Native Class:Line number
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:819
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

Caused by:
==========

Description:
    Execution of the external command 'drbdadm' failed.
Cause:
    The external command did not complete within the timeout.
    Possible causes include:
    - The system load may be too high to ensure completion of external commands in a timely manner.
    - The program implementing the external command may not be operating properly.
    - The operating system may have entered an erroneous state.
Correction:
    Check whether the external program and the operating system are still operating properly.
    Check whether the system's load is within normal parameters.
Additional information:
    The full command line executed was:
    drbdadm -vvv resize pvc-3d40ff9a-ec83-4985-9d77-bc073256ad15/0

Category:                           LinStorException
Class name:                         ExtCmdFailedException
Class canonical name:               com.linbit.extproc.ExtCmdFailedException
Generated at:                       Method 'execute', Source file 'DrbdAdm.java', Line #598

Error message:                      The external command 'drbdadm' did not complete within the timeout

Call backtrace:

    Method                                   Native Class:Line number
    execute                                  N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:598
    resize                                   N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:122
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:644
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

Caused by:
==========

Category:                           Exception
Class name:                         ChildProcessTimeoutException
Class canonical name:               com.linbit.ChildProcessTimeoutException
Generated at:                       Method 'waitFor', Source file 'ChildProcessHandler.java', Line #133

Call backtrace:

    Method                                   Native Class:Line number
    waitFor                                  N      com.linbit.extproc.ChildProcessHandler:133
    syncProcess                              N      com.linbit.extproc.ExtCmd:156
    pipeExec                                 N      com.linbit.extproc.ExtCmd:104
    execute                                  N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:590
    resize                                   N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:122
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:644
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

END OF ERROR REPORT.
dimm0 commented 2 years ago

Anybody?

biozit commented 2 years ago

Up