LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
984 stars 76 forks source link

Endless pvdisplay causes linstor-satellite malfunctioning #333

Open kvaps opened 1 year ago

kvaps commented 1 year ago

Hi, I just faced with problem when creating device the creation was stuck. Using strace I found out that it is due to stuck pvdisplay command. It creates /run/lock/lvm/V_data lock, so any other run makes it stuck forever.

None of commands like linstor resource create linstor storage-pool list are working

Steps to reproduce:

flock -x /run/lock/lvm/V_data sleep infinity

Possible workaround: to use --nolocking for harmless commands

Proposed solution: add timeout for commands execution

https://asciinema.org/a/C3nEJltt4ln0dskCbtjIXjKWj

ERROR REPORT 639C8098-D7095-000019

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Satellite
Version:                            1.20.0
Build ID:                           9c6f7fad48521899f7a99c564b1d33aeacfdbfa8
Build time:                         2022-11-07T16:37:38+00:00
Error time:                         2022-12-22 12:21:28
Node:                               gpnvkc-w2

============================================================

Reported error:
===============

Description:
    Failed to get physical devices for volume group: data
Cause:
    External command timed out
Additional information:
    External command: pvdisplay --columns -o pv_name -S vg_name=data --noheadings --nosuffix

Category:                           LinStorException
Class name:                         StorageException
Class canonical name:               com.linbit.linstor.storage.StorageException
Generate
d at:                       Method 'genericExecutor', Source file 'Commands.java', Line #118

Error message:                      Failed to get physical devices for volume group: data

Error context:
    An error occurred while processing resource 'Node: 'gpnvkc-w2', Rsc: 'test''

Call backtrace:

    Method                                   Native Class:Line number
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:118
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:61
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:49
    listPhysicalVolumes                      N      com.linbit.linstor.layer.storage.lvm.utils.LvmCommands:619
    getPhysicalVolumes                       N      com.linbit.linstor.layer.storage.lvm.utils.LvmUtils:471
    getLvmConfig                             N      com.linbit.linstor.layer.storage.lvm.utils.LvmUtils:113

  recacheLvmConfig                         N      com.linbit.linstor.layer.storage.lvm.utils.LvmUtils:163
    execWithRetry                            N      com.linbit.linstor.layer.storage.lvm.utils.LvmUtils:499
    createLvImpl                             N      com.linbit.linstor.layer.storage.lvm.LvmThinProvider:141
    createLvImpl                             N      com.linbit.linstor.layer.storage.lvm.LvmThinProvider:48
    createVolumes                            N      com.linbit.linstor.layer.storage.AbsStorageProvider:629
    process                                  N      com.linbit.linstor.layer.storage.AbsStorageProvider:399
    process                                  N      com.linbit.linstor.layer.storage.StorageLayer:311
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processChild                             N      com.linbit.linstor.layer.drbd.DrbdLayer:459
    adjustDrbd                               N      com.linbit.li
nstor.layer.drbd.DrbdLayer:580
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

Caused by:
==========

Category:                           Exception
Cla
ss name:                         ChildProcessTimeoutException
Class canonical name:               com.linbit.ChildProcessTimeoutException
Generated at:                       Method 'waitFor', Source file 'ChildProcessHandler.java', Line #141

Call backtrace:

    Method                                   Native Class:Line number
    waitFor                                  N      com.linbit.extproc.ChildProcessHandler:141
    syncProcess                              N      com.linbit.extproc.ExtCmd:156
    exec                                     N      com.linbit.extproc.ExtCmd:90
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:77
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:61
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:49
    listPhysicalVolumes                      N      com.linbit.linstor.layer.storage.lvm.utils.LvmCommands:619

    getPhysicalVolumes                       N      com.linbit.linstor.layer.storage.lvm.utils.LvmUtils:471
    getLvmConfig                             N      com.linbit.linstor.layer.storage.lvm.utils.LvmUtils:113
    recacheLvmConfig                         N      com.linbit.linstor.layer.storage.lvm.utils.LvmUtils:163
    execWithRetry                            N      com.linbit.linstor.layer.storage.lvm.utils.LvmUtils:499
    createLvImpl                             N      com.linbit.linstor.layer.storage.lvm.LvmThinProvider:141
    createLvImpl                             N      com.linbit.linstor.layer.storage.lvm.LvmThinProvider:48
    createVolumes                            N      com.linbit.linstor.layer.storage.AbsStorageProvider:629
    process                                  N      com.linbit.linstor.layer.storage.AbsStorageProvider:399
    process                                  N      com.linbit.linstor.layer.storage.StorageLayer:311
    process                                  N
     com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processChild                             N      com.linbit.linstor.layer.drbd.DrbdLayer:459
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:580
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run
        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

END OF ERROR REPORT.
ghernadi commented 1 year ago

Hello,

thanks for the suggestions. We will think about the --nolocking option, which does sound good, but we still need to make sure that it is indeed harmless for pvdisplay, vgs, lvs and similar "read only operations".

Regarding the timeouts: Linstor already has a timeout, otherwise you would not be able to show an ErrorReport of a ChildProcessTimeoutException :)
Linstor waits by default 45 seconds for the child process, another 15 seconds to try to kill the process after the first 45s passed and yet another 5 seconds to forcibly kill the process if the previous attempt did not work. In sum Linstor waits max 65 seconds if a command does not terminate and does not let itself be killed. In the demo you showed, I guess you simply did not wait long enough?