LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
984 stars 76 forks source link

DRBD resources on diskful nodes are diskless after resize if hosts are hard power-cycled #423

Closed modcritical closed 1 month ago

modcritical commented 1 month ago

I encountered an issue with a Proxmox cluster running Linstor where, after a power outage and full cluster reboot, VMs that had disks resized failed to start. I narrowed the issue down to a problem with DRBD replicas in Secondary state when one replica is Primary and Linstor is used to resize the volume.

If a Linstor volume is resized while Primary on a host, then a DRBD Secondary for that volume is powered off, the Secondary's DRBD resource will not be able to come online after the host is back up. If the Primary is power cycled, the Primary's DRBD resource will sometimes not come back online after the host is back up. If multiple nodes are power-cycled, all Secondaries and sometimes the Primary will have this issue.

The issue can be identified by running drbd adjust on power-cycled replicas when they are back up, which will fail with an error saying there is no activity log. This situation results in an inability to re-establish quorum when multiple hosts power-cycle at once, such as in a power outage.

This can be fixed by running drbdadm create-md pm-nnnnnnnn on the nodes that have backing volumes but are showing as diskless, then if necessary, forcing one to be Primary to start a sync.

This issue occurs no matter how far in the past the disk was resized. The first time I encountered the issue was a power outage over two months after having resized a VM disk and that DRBD resource did not come back online.

The issue does not occur if the host is shutdown gracefully: shutdown -h now If the host is power-cycled after a graceful shutdown or reboot, the issue will not occur. The issue does not occur if the volume is resized outside of Linstor - lvextend or zfs set volsize then drbdadm resize, with or without --assume-clean. I am not sure what Linstor is doing differently in the resize process.

My full notes on troubleshooting this issue are attached. Reproduction attempts 3 through 7 are the most interesting. Attempts 1 and 2 contain a bunch of actions / thoughts that turned out to be unrelated. Linstor Single Replica after Resize then Cluster Restart.md

Debugging Summary

Oct 16 07:14:52 pve2 Satellite[1256]: 2024-10-16 07:14:52.314 [DeviceManager] ERROR LINSTOR/Satellite/38ea23 SYSTEM - Failed to adjust DRBD resource pm-e4e  
56ebe [Report number 670FAE0D-F21CA-000000]  
Oct 16 07:15:12 pve2 Satellite[1256]: 2024-10-16 07:15:12.787 [DeviceManager] ERROR LINSTOR/Satellite/1dd90e SYSTEM - Failed to adjust DRBD resource pm-e4e  
56ebe [Report number 670FAE0D-F21CA-000001]

Linstor controller/satellite are 1.29.1.

DRBD version info:


DRBDADM_BUILDTAG=GIT-hash:\ ba2ce9037989b6141222c7901d1219cf852949f1\ build\ by\ @buildsystem\,\ 2024-05-06\ 07:39:11
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x09020b
DRBD_KERNEL_VERSION=9.2.11
DRBDADM_VERSION_CODE=0x091c00
DRBDADM_VERSION=9.28.0
rp- commented 1 month ago

Thanks for the very nice bug report. Any chance you have any RAID controller with caching in your setup or any other type of volatile cache in your backend storage?

modcritical commented 1 month ago

This issue was produced with a single backing disk on each node, attached to a SATA controller in AHCI mode - no RAID. I reproduced the issue with both enterprise SSD and consumer SSD backing disks (Intel S4620 and Crucial MX500). The issue also occurs if I run sync ; sync on each node before power-cycling, which ensures any pages pending a write to disk are written before the power-cycle.

I know the log of stuff I tried is very long, so let me call out this part here:

~The issue does not occur if the volume is resized outside of Linstor - lvextend or zfs set volsize then drbdadm resize, with or without --assume-clean~

~It appears there is something different being done when Linstor resizes the DRBD resource vs. the DRBD documented resize process. I think that is the closest I got to the root of the issue.~

EDIT: The above appears incorrect - I tried again and was not able to manually perform a resize that didn't exhibit this issue after a power-cycle of a host where the volume was in Secondary state.

(also, I see now that I had check-resize in the drbdadm command in my original post. This is incorrect, I was using the resize command. I validated that in my bash_history, and have edited the original post to be correct)

ghernadi commented 1 month ago

My first guess was also some caching issues... I tried but failed to reproduce this issue, although I am only trying this with DRBD and LINSTOR running in VMs..

It appears there is something different being done when Linstor resizes the DRBD resource vs. the DRBD documented resize process. I think that is the closest I got to the root of the issue.

You might want to enable trace logging on the satellites, using linstor c set-log-level TRACE --global. With this LINSTOR will log every external command it executes, including LVM, ZFS and DRBD related commands. Right now, it is quite difficult for us to investigate since we can only guess without the ability to reproduce or test anything.

jesulo commented 1 month ago

Hi, how is there a decreasing resize of my ZFS storage with Linstor? Cheers!

modcritical commented 1 month ago

@ghernadi - I have repeated the process with trace logging enabled.

Notably, when I attempted to re-create the resize sequence that did not fail earlier, both tests resulted in the problem on the Secondary. I said in my initial report that manually using lvextend and drbdadm resize does not cause the problem, but that seems to be incorrect.

For the normal process - resizing with Linstor - Here are all the syslog entries from around the time of the resize: linstor-resize-bug-trace-log-pve1-20241017.txt linstor-resize-bug-trace-log-pve2-20241017.txt linstor-resize-bug-trace-log-pve3-20241017.txt

And here are the same logs with all 3 nodes folded together and sorted by time, in case the order of events between nodes is important: linstor-resize-bug-trace-log-allnodes-20241017.txt

Here are my notes from the various scenarios I tried today:

Resize with Linstor

root@pve3:~# date ; linstor vd set-size pm-1f408657 0 22G ; date
Thu Oct 17 10:27:23 AM CDT 2024
SUCCESS:
Description:
    Volume definition with number '0' of resource definition 'pm-1f408657' modified.
Details:
    Volume definition with number '0' of resource definition 'pm-1f408657' UUID is: 57322b05-295d-44a8-b224-87e4e951a119
SUCCESS:
Description:
    (pve2) Resource 'pm-1f408657' [DRBD] not adjusted
Cause:
    This happened most likely because the layer below did not provide a device to work with.
SUCCESS:
    Updated volume 0 of 'pm-1f408657' on 'pve2'
SUCCESS:
    (pve1) Volume number 0 of resource 'pm-1f408657' [LVM-Thin] resized
SUCCESS:
Description:
    (pve1) Resource 'pm-1f408657' [DRBD] not adjusted
Cause:
    This happened most likely because the layer below did not provide a device to work with.
SUCCESS:
    Updated volume 0 of 'pm-1f408657' on 'pve1'
SUCCESS:
    (pve3) Volume number 0 of resource 'pm-1f408657' [LVM-Thin] resized
SUCCESS:
Description:
    (pve3) Resource 'pm-1f408657' [DRBD] not adjusted
Cause:
    This happened most likely because the layer below did not provide a device to work with.
SUCCESS:
    Updated volume 0 of 'pm-1f408657' on 'pve3'
SUCCESS:
Description:
    (pve2) Resource 'pm-1f408657' [DRBD] not adjusted
Cause:
    This happened most likely because the layer below did not provide a device to work with.
SUCCESS:
    Resized resource 'pm-1f408657' on 'pve2'
SUCCESS:
Description:
    (pve1) Resource 'pm-1f408657' [DRBD] not adjusted
Cause:
    This happened most likely because the layer below did not provide a device to work with.
SUCCESS:
    Resized resource 'pm-1f408657' on 'pve1'
SUCCESS:
Description:
    (pve3) Resource 'pm-1f408657' [DRBD] not adjusted
Cause:
    This happened most likely because the layer below did not provide a device to work with.
SUCCESS:
    Resized resource 'pm-1f408657' on 'pve3'
SUCCESS:
Description:
    (pve2) Resource 'pm-1f408657' [DRBD] not adjusted
Cause:
    This happened most likely because the layer below did not provide a device to work with.
SUCCESS:
Description:
    (pve3) Resource 'pm-1f408657' [DRBD] not adjusted
Cause:
    This happened most likely because the layer below did not provide a device to work with.
SUCCESS:
    (pve1) Resource 'pm-1f408657' [DRBD] adjusted.
SUCCESS:
    Resized DRBD resource 'pm-1f408657' on 'pve1'
SUCCESS:
    (pve2) Resource 'pm-1f408657' [DRBD] adjusted.
SUCCESS:
    Updated volume 0 of 'pm-1f408657' on 'pve2'
SUCCESS:
    (pve1) Resource 'pm-1f408657' [DRBD] adjusted.
SUCCESS:
    Updated volume 0 of 'pm-1f408657' on 'pve1'
SUCCESS:
    (pve3) Resource 'pm-1f408657' [DRBD] adjusted.
SUCCESS:
    Updated volume 0 of 'pm-1f408657' on 'pve3'
Thu Oct 17 10:27:24 AM CDT 2024

root@pve1:~# drbdadm status pm-1f408657
pm-1f408657 role:Secondary
  disk:UpToDate
  pve2 role:Secondary
    peer-disk:Diskless
  pve3 role:Primary
    peer-disk:UpToDate

root@pve1:~# sync ; sync
root@pve1:~# echo b > /proc/sysrq-trigger

--- HARD REBOOT pve1 ---

root@pve1:~# drbdadm status pm-1f408657
pm-1f408657 role:Secondary
  disk:Diskless quorum:no
  pve2 connection:StandAlone
  pve3 connection:StandAlone

root@pve1:~# drbdadm create-md pm-1f408657
You want me to create a v09 style flexible-size internal meta data block.
There appears to be a v09 flexible-size internal meta data block
already in place on /dev/cssd1/pm-1f408657_00000 at byte offset 23630704640

Do you really want to overwrite the existing meta-data?
[need to type 'yes' to confirm] yes

md_offset 23630704640
al_offset 23630671872
bm_offset 23629946880

Found some data

 ==> This might destroy existing data! <==

Do you want to proceed?
[need to type 'yes' to confirm] yes

initializing activity log
initializing bitmap (708 KB) to all zero
Writing meta data...
New drbd meta data block successfully created.
root@pve1:~# drbdadm adjust pm-1f408657
root@pve1:~# drbdadm status pm-1f408657
pm-1f408657 role:Secondary
  disk:Inconsistent
  pve2 role:Secondary
    peer-disk:Diskless resync-suspended:dependency
  pve3 role:Primary
    replication:SyncTarget peer-disk:UpToDate done:0.04

root@pve1:~# drbdadm status pm-1f408657
pm-1f408657 role:Secondary
  disk:Inconsistent
  pve2 role:Secondary
    peer-disk:Diskless resync-suspended:dependency
  pve3 role:Primary
    replication:SyncTarget peer-disk:UpToDate done:0.57

^^^^ PROBLEM REPRODUCED ^^^^

Repeat, But run all commands manually and in the same order as Linstor

root@pve3:~# lvresize --config 'devices { filter=['"'"'a|/dev/sdb|'"'"','"'"'r|.*|'"'"'] }' --size 23076864k cssd1/pm-0726223e_00000 -f
  Size of logical volume cssd1/pm-0726223e_00000 changed from <20.01 GiB (5122 extents) to <22.01 GiB (5634 extents).
  Logical volume cssd1/pm-0726223e_00000 successfully resized.

root@pve1:~# lvresize --config 'devices { filter=['"'"'a|/dev/sdc|'"'"','"'"'r|.*|'"'"'] }' --size 23076864k cssd1/pm-0726223e_00000 -f
  Size of logical volume cssd1/pm-0726223e_00000 changed from <20.01 GiB (5122 extents) to <22.01 GiB (5634 extents).
  Logical volume cssd1/pm-0726223e_00000 successfully resized.

root@pve1:~# drbdsetup wait-connect-resource --wait-after-sb=yes --wfc-timeout=10 pm-0726223e
degr-wfc-timeout has to be shorter than wfc-timeout
degr-wfc-timeout implicitly set to wfc-timeout (10s)
outdated-wfc-timeout has to be shorter than degr-wfc-timeout
outdated-wfc-timeout implicitly set to degr-wfc-timeout (10s)

root@pve1:~# drbdadm -vvv -- --assume-clean resize pm-0726223e/0
drbdsetup resize 1018 --assume-clean
drbdsetup check-resize 1018

root@pve1:~# drbdadm -vvv adjust pm-0726223e

root@pve2:~# drbdadm -vvv adjust pm-0726223e

root@pve3:~# drbdadm -vvv adjust pm-0726223e

root@pve1:~# drbdadm -vvv adjust pm-0726223e

root@pve1:~# sync ; sync
root@pve1:~# echo b > /proc/sysrq-trigger

---- HARD REBOOT OF pve1 ---

root@pve1:~# drbdadm status pm-0726223e
pm-0726223e role:Secondary
  disk:Diskless quorum:no
  pve2 connection:StandAlone
  pve3 connection:StandAlone

root@pve1:~# drbdadm create-md pm-0726223e
You want me to create a v09 style flexible-size internal meta data block.
There appears to be a v09 flexible-size internal meta data block
already in place on /dev/cssd1/pm-0726223e_00000 at byte offset 23630704640

Do you really want to overwrite the existing meta-data?
[need to type 'yes' to confirm] yes

md_offset 23630704640
al_offset 23630671872
bm_offset 23629946880

Found some data

 ==> This might destroy existing data! <==

Do you want to proceed?
[need to type 'yes' to confirm] yes

initializing activity log
initializing bitmap (708 KB) to all zero
Writing meta data...
New drbd meta data block successfully created.
root@pve1:~# drbdadm adjust pm-0726223e
root@pve1:~# drbdadm status pm-0726223e
pm-0726223e role:Secondary
  disk:Inconsistent
  pve2 role:Secondary
    peer-disk:Diskless resync-suspended:dependency
  pve3 role:Primary
    replication:SyncTarget peer-disk:UpToDate done:2.85

^^^^ PROBLEM REPRODUCED ^^^^

Repeat, but with drbdadm resize on pve3 instead of pve1

root@pve3:~# lvresize --config 'devices { filter=['"'"'a|/dev/sdb|'"'"','"'"'r|.*|'"'"'] }' --size 23076864k cssd1/pm-68f10490_00000 -f
  Size of logical volume cssd1/pm-68f10490_00000 changed from <20.01 GiB (5122 extents) to <22.01 GiB (5634 extents).
  Logical volume cssd1/pm-68f10490_00000 successfully resized.

root@pve1:~# lvresize --config 'devices { filter=['"'"'a|/dev/sdc|'"'"','"'"'r|.*|'"'"'] }' --size 23076864k cssd1/pm-68f10490_00000 -f
  Size of logical volume cssd1/pm-68f10490_00000 changed from <20.01 GiB (5122 extents) to <22.01 GiB (5634 extents).
  Logical volume cssd1/pm-68f10490_00000 successfully resized.

root@pve3:~# drbdsetup wait-connect-resource --wait-after-sb=yes --wfc-timeout=10 pm-68f10490
degr-wfc-timeout has to be shorter than wfc-timeout
degr-wfc-timeout implicitly set to wfc-timeout (10s)
outdated-wfc-timeout has to be shorter than degr-wfc-timeout
outdated-wfc-timeout implicitly set to degr-wfc-timeout (10s)

root@pve3:~# drbdadm -vvv -- --assume-clean resize pm-68f10490/0
drbdsetup resize 1018 --assume-clean
drbdsetup check-resize 1018

root@pve3:~# drbdadm -vvv adjust pm-68f10490/0

root@pve2:~# drbdadm -vvv adjust pm-68f10490/0

root@pve1:~# drbdadm -vvv adjust pm-68f10490/0

root@pve3:~# drbdadm -vvv adjust pm-68f10490/0

root@pve1:~# sync ; sync
root@pve1:~# echo b > /proc/sysrq-trigger

root@pve1:~# uptime
 13:50:14 up 2 min,  2 users,  load average: 0.80, 0.41, 0.16
root@pve1:~# drbdadm status pm-68f10490
pm-68f10490 role:Secondary
  disk:Diskless quorum:no
  pve2 connection:StandAlone
  pve3 connection:StandAlone

^^^^ PROBLEM REPRODUCED ^^^^

Repeat, with vanilla drbdadm resize command

root@pve3:~# drbdadm status pm-b6bf9394
pm-b6bf9394 role:Primary
  disk:UpToDate
  pve1 role:Secondary
    peer-disk:UpToDate
  pve2 role:Secondary
    peer-disk:Diskless

root@pve3:~# linstor vd list -r pm-b6bf9394
╭───────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ VolumeNr ┊ VolumeMinor ┊ Size      ┊ Gross ┊ State ┊
╞═══════════════════════════════════════════════════════════════════╡
┊ pm-b6bf9394  ┊ 0        ┊ 1018        ┊ 20.00 GiB ┊       ┊ ok    ┊
╰───────────────────────────────────────────────────────────────────╯
root@pve3:~# linstor v list -r pm-b6bf9394
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource    ┊ StoragePool          ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ Allocated ┊ InUse  ┊      State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pve1 ┊ pm-b6bf9394 ┊ cssd1                ┊     0 ┊    1018 ┊ /dev/drbd1018 ┊ 20.01 GiB ┊ Unused ┊   UpToDate ┊
┊ pve2 ┊ pm-b6bf9394 ┊ DfltDisklessStorPool ┊     0 ┊    1018 ┊ /dev/drbd1018 ┊           ┊ Unused ┊ TieBreaker ┊
┊ pve3 ┊ pm-b6bf9394 ┊ cssd1                ┊     0 ┊    1018 ┊ /dev/drbd1018 ┊ 20.01 GiB ┊ InUse  ┊   UpToDate ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

root@pve3:~# lvresize --config 'devices { filter=['"'"'a|/dev/sdb|'"'"','"'"'r|.*|'"'"'] }' --size 23076864k cssd1/pm-b6bf9394_00000 -f
  Size of logical volume cssd1/pm-b6bf9394_00000 changed from <20.01 GiB (5122 extents) to <22.01 GiB (5634 extents).
  Logical volume cssd1/pm-b6bf9394_00000 successfully resized.

root@pve1:~# lvresize --config 'devices { filter=['"'"'a|/dev/sdc|'"'"','"'"'r|.*|'"'"'] }' --size 23076864k cssd1/pm-b6bf9394_00000 -f
  Size of logical volume cssd1/pm-b6bf9394_00000 changed from <20.01 GiB (5122 extents) to <22.01 GiB (5634 extents).
  Logical volume cssd1/pm-b6bf9394_00000 successfully resized.

root@pve3:~# drbdadm resize pm-b6bf9394
root@pve3:~# drbdadm status pm-b6bf9394
pm-b6bf9394 role:Primary
  disk:UpToDate
  pve1 role:Secondary
    replication:SyncSource peer-disk:Inconsistent done:12.33
  pve2 role:Secondary
    peer-disk:Diskless

root@pve3:~# drbdadm status pm-b6bf9394
pm-b6bf9394 role:Primary
  disk:UpToDate
  pve1 role:Secondary
    peer-disk:UpToDate
  pve2 role:Secondary
    peer-disk:Diskless

root@pve1:~# sync ; sync
root@pve1:~# echo b > /proc/sysrq-trigger

---- HARD REBOOT OF pve1 ----

root@pve1:~# uptime
 14:31:23 up 5 min,  2 users,  load average: 0.29, 0.44, 0.25
root@pve1:~# drbdadm status pm-68f10490
'pm-68f10490' not defined in your config (for this host).
root@pve1:~# drbdadm status pm-b6bf9394
pm-b6bf9394 role:Secondary
  disk:Diskless quorum:no
  pve2 connection:StandAlone
  pve3 connection:StandAlone

^^^^ PROBLEM REPRODUCED ^^^^

Repeat, with the process I thought did not result in the issue earlier

root@pve3:~# linstor vd list -r pm-bbc87056
╭───────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ VolumeNr ┊ VolumeMinor ┊ Size      ┊ Gross ┊ State ┊
╞═══════════════════════════════════════════════════════════════════╡
┊ pm-bbc87056  ┊ 0        ┊ 1018        ┊ 20.00 GiB ┊       ┊ ok    ┊
╰───────────────────────────────────────────────────────────────────╯
root@pve3:~# linstor v list -r pm-bbc87056
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource    ┊ StoragePool          ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ Allocated ┊ InUse  ┊      State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pve1 ┊ pm-bbc87056 ┊ cssd1                ┊     0 ┊    1018 ┊ /dev/drbd1018 ┊ 20.01 GiB ┊ Unused ┊   UpToDate ┊
┊ pve2 ┊ pm-bbc87056 ┊ DfltDisklessStorPool ┊     0 ┊    1018 ┊ /dev/drbd1018 ┊           ┊ Unused ┊ TieBreaker ┊
┊ pve3 ┊ pm-bbc87056 ┊ cssd1                ┊     0 ┊    1018 ┊ /dev/drbd1018 ┊ 20.01 GiB ┊ InUse  ┊   UpToDate ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

root@pve3:~# lvextend -L22G /dev/cssd1/pm-bbc87056_00000
  Size of logical volume cssd1/pm-bbc87056_00000 changed from <20.01 GiB (5122 extents) to 22.00 GiB (5632 extents).
  Logical volume cssd1/pm-bbc87056_00000 successfully resized.
root@pve3:~# drbdadm resize pm-bbc87056

root@pve1:~# drbdadm status pm-bbc87056
pm-bbc87056 role:Secondary
  disk:Inconsistent
  pve2 role:Secondary
    peer-disk:Diskless resync-suspended:dependency
  pve3 role:Primary
    replication:SyncTarget peer-disk:UpToDate done:18.65

root@pve1:~# drbdadm status pm-bbc87056
pm-bbc87056 role:Secondary
  disk:UpToDate
  pve2 role:Secondary
    peer-disk:Diskless
  pve3 role:Primary
    peer-disk:UpToDate

root@pve1:~# sync ; sync
root@pve1:~# echo b > /proc/sysrq-trigger

---- HARD REBOOT OF pve1 ----

root@pve1:~# drbdadm status pm-bbc87056
pm-bbc87056 role:Secondary
  disk:Diskless quorum:no
  pve2 connection:StandAlone
  pve3 connection:StandAlone

Same problem, so this might be at the DRBD layer?

modcritical commented 1 month ago

I was able to re-create this with just DRBD and with VMs - so this is not actually a Linstor bug. This can probably be closed, and I will open a bug report on the drbd repo.