LINBIT / drbd

LINBIT DRBD kernel module
https://docs.linbit.com/docs/users-guide-9.0/
GNU General Public License v2.0
573 stars 96 forks source link

DRBD resource stuck in SyncTarget state and unable to proceed with Syncing #40

Open Dipanshu-Sehjal opened 2 years ago

Dipanshu-Sehjal commented 2 years ago

Hi, After a serial/rolling upgrade of k8s cluster, one of the DRBD resources was found stuck in SyncTarget.


Linstor version - [root@flex-103 ~]# k exec --namespace=piraeus deployment/piraeus-op-piraeus-operator-cs-controller -- linstor --version linstor 1.13.0; GIT-hash: 840cf57c75c166659509e22447b2c0ca6377ee6d

DRBD version - [root@flex-103 ~]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-jrhwv -c linstor-satellite -- drbdadm --version DRBDADM_BUILDTAG=GIT-hash:\ 087ee6b4961ca154d76e4211223b03149373bed8\ build\ by\ @buildsystem\,\ 2022-01-28\ 12:19:33 DRBDADM_API_VERSION=2 DRBD_KERNEL_VERSION_CODE=0x090106 DRBD_KERNEL_VERSION=9.1.6 DRBDADM_VERSION_CODE=0x091402 DRBDADM_VERSION=9.20.2

Piraeus 1.8.0


Setup details -

K8s cluster is a 3 nodes setup - 2 disk nodes and 1 Diskless node with Protocol C replication. Disk node with InUse resource and SyncTarget is flex-106 (shorted as 106) Disk node with Unused resource and UpToDate is flex-107 (shorted as 107) Diskless node with Unused resource is flex-108 (shorted as 108)

Some relevant info -

[root@flex-103 ~]# k exec --namespace=piraeus deployment/piraeus-op-piraeus-operator-cs-controller -- linstor r l | grep pvc-353143c5-e55d-4c75-98be-5248124dd160
| pvc-353143c5-e55d-4c75-98be-5248124dd160 | flex-106.dr.avaya.com | 7009 | InUse  | Ok    | SyncTarget(64.11%) | 2022-05-25 22:59:58 |
| pvc-353143c5-e55d-4c75-98be-5248124dd160 | flex-107.dr.avaya.com | 7009 | Unused | Ok    |           UpToDate | 2022-05-25 23:00:00 |
| pvc-353143c5-e55d-4c75-98be-5248124dd160 | flex-108.dr.avaya.com | 7009 | Unused | Ok    |           Diskless | 2022-05-25 22:59:59 |

DRBD logs from the disk node (flex-106) that has this PVC resource stuck in SyncTarget -

[root@flex-106 ccmuser]# grep pvc-353143c5-e55d-4c75-98be-5248124dd160  /var/log/messages 
May 25 17:51:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
May 25 17:51:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009: quorum( no -> yes )
May 25 17:51:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009: Would lose quorum, but using tiebreaker logic to keep
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 3657581006 (0->1 496/16)
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 3657581006: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFF9
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 3657581006 (3ms)
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( Connected -> Disconnecting ) peer( Secondary -> Unknown )
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: ack_receiver terminated
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Terminating ack_recv thread
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Terminating sender thread
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Starting sender thread (from drbd_r_pvc-3531 [7369])
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Connection closed
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( Disconnecting -> StandAlone )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Terminating receiver thread
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( StandAlone -> Unconnected )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Starting receiver thread (from drbd_w_pvc-3531 [7349])
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( Unconnected -> Connecting )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Handshake to peer 1 successful: Agreed network protocol version 121
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Peer authenticated using 20 bytes HMAC
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Starting ack_recv thread (from drbd_r_pvc-3531 [83213])
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 484053172 (0->1 499/146)
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: drbd_sync_handshake:
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: self 93E3EF437B3C9C06:0000000000000000:CB4D79960DC2245A:43030D49A0AAB1B8 bits:0 flags:100
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: peer 2744834DBA493B50:0000000000000000:93E3EF437B3C9C06:CB4D79960DC2245A bits:2589 flags:1100
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: uuid_compare()=target-set-bitmap by rule=history-peer
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: Setting and writing one bitmap slot, after drbd_sync_handshake
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 484053172: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFF8
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 484053172 (22ms)
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009: disk( Consistent -> Outdated )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009: disk( Outdated -> Inconsistent )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-108.dr.avaya.com: resync-susp( no -> connection dependency )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: repl( WFBitMapT -> SyncTarget )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: Began resync as SyncTarget (will sync 2100760 KB [525190 bits set]).
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 236192581 (0->2 8176/3088)
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 236192581: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFF9
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 236192581 (2ms)
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: conn( Connected -> Disconnecting ) peer( Primary -> Unknown )
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-108.dr.avaya.com: pdsk( Diskless -> DUnknown ) repl( Established -> Off )
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: ack_receiver terminated
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Terminating ack_recv thread
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Terminating sender thread
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Starting sender thread (from drbd_r_pvc-3531 [7370])
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Connection closed
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: conn( Disconnecting -> StandAlone )
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Terminating receiver thread
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: conn( StandAlone -> Unconnected )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Starting receiver thread (from drbd_w_pvc-3531 [7349])
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: conn( Unconnected -> Connecting )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Handshake to peer 2 successful: Agreed network protocol version 121
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Peer authenticated using 20 bytes HMAC
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Starting ack_recv thread (from drbd_r_pvc-3531 [83820])
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 3661494316 (0->2 499/146)
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 3661494316: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFF8
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 3661494316 (6ms)
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: conn( Connecting -> Connected ) peer( Unknown -> Primary )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-108.dr.avaya.com: pdsk( DUnknown -> Diskless ) repl( Off -> Established )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: strategy = target-set-bitmap
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: drbd_sync_handshake:
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: self 93E3EF437B3C9C06:0000000000000000:CB4D79960DC2245A:43030D49A0AAB1B8 bits:333702 flags:104
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: peer 2744834DBA493B50:0000000000000000:93E3EF437B3C9C06:CB4D79960DC2245A bits:2589 flags:1100
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: uuid_compare()=target-set-bitmap by rule=history-peer
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: Setting and writing one bitmap slot, after drbd_sync_handshake
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009: ASSERTION current != device->resource->worker.task FAILED in drbd_bitmap_io
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: Becoming WFBitMapT because primary is diskless
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change failed: Can not start OV/resync since it is already active
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-108.dr.avaya.com: Failed: resync-susp( connection dependency -> no )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: Failed: repl( SyncTarget -> WFBitMapT )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: ...postponing this until current resync finished
May 25 19:14:22 flex-106 kubelet[2255]: I0525 19:14:22.787201    2255 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:22 flex-106 kubelet[2255]: E0525 19:14:22.787266    2255 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160 podName: nodeName:}" failed. No retries permitted until 2022-05-25 19:14:23.28724099 -0600 MDT m=+5071.728082294 (durationBeforeRetry 500ms). Error: Volume has not been added to the list of VolumesInUse in the node's volume status for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3")
May 25 19:14:22 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: BAD! enr=0 rs_left=-1 rs_failed=0 count=1 cstate=Connected SyncTarget
May 25 19:14:22 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: peer( Primary -> Secondary )
May 25 19:14:23 flex-106 kubelet[2255]: I0525 19:14:23.293106    2255 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:23 flex-106 kubelet[2255]: E0525 19:14:23.293189    2255 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160 podName: nodeName:}" failed. No retries permitted until 2022-05-25 19:14:24.293174982 -0600 MDT m=+5072.734016279 (durationBeforeRetry 1s). Error: Volume has not been added to the list of VolumesInUse in the node's volume status for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3")
May 25 19:14:24 flex-106 kubelet[2255]: I0525 19:14:24.301762    2255 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:24 flex-106 kubelet[2255]: E0525 19:14:24.301836    2255 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160 podName: nodeName:}" failed. No retries permitted until 2022-05-25 19:14:26.301822172 -0600 MDT m=+5074.742663467 (durationBeforeRetry 2s). Error: Volume has not been added to the list of VolumesInUse in the node's volume status for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3")
May 25 19:14:26 flex-106 kubelet[2255]: I0525 19:14:26.319722    2255 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:26 flex-106 kubelet[2255]: E0525 19:14:26.324195    2255 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160 podName: nodeName:}" failed. No retries permitted until 2022-05-25 19:14:30.324175949 -0600 MDT m=+5078.765017244 (durationBeforeRetry 4s). Error: Volume not attached according to node status for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3")
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.347525    2255 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.354760    2255 operation_generator.go:1523] Controller attach succeeded for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3") device path: ""
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.448399    2255 reconciler.go:269] "operationExecutor.MountVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.448643    2255 operation_generator.go:587] MountVolume.WaitForAttach entering for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3") DevicePath ""
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.453864    2255 operation_generator.go:597] MountVolume.WaitForAttach succeeded for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3") DevicePath "csi-a9e5b60d7e750fc54fdefe9a0b376ceb20ade3d5650f9c5533faaf827c6ac10e"
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.457200    2255 operation_generator.go:630] MountVolume.MountDevice succeeded for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3") device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-353143c5-e55d-4c75-98be-5248124dd160/globalmount"
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 266334151 (0->-1 3/1)
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 266334151: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFF8
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 266334151 (1ms)
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: role( Secondary -> Primary )
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: role( Primary -> Secondary )
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 195577500 (0->-1 3/1)
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 195577500: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFF8
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 195577500 (1ms)
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: role( Secondary -> Primary )
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.682158    2255 operation_generator.go:712] MountVolume.SetUp succeeded for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3")
May 25 20:00:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: BAD! enr=1 rs_left=-55 rs_failed=0 count=55 cstate=Connected SyncTarget
[root@flex-103 cust]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-4vmfm -c linstor-satellite -- drbdadm dstate pvc-353143c5-e55d-4c75-98be-5248124dd160
Inconsistent/Diskless/UpToDate
[root@flex-103 cust]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-4vmfm -c linstor-satellite -- drbdadm cstate pvc-353143c5-e55d-4c75-98be-5248124dd160
Connected
Connected
[root@flex-103 ~]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-4vmfm -c linstor-satellite -- drbdadm status pvc-353143c5-e55d-4c75-98be-5248124dd160 --verbose
drbdsetup status pvc-353143c5-e55d-4c75-98be-5248124dd160
pvc-353143c5-e55d-4c75-98be-5248124dd160 role:Primary
  disk:Inconsistent
  flex-107.dr.avaya.com role:Secondary
    replication:SyncTarget peer-disk:UpToDate done:64.12
  flex-108.dr.avaya.com role:Secondary
    peer-disk:Diskless peer-client:yes resync-suspended:dependency

[root@flex-103 ~]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-4vmfm -c linstor-satellite -- drbdsetup status pvc-353143c5-e55d-4c75-98be-5248124dd160
pvc-353143c5-e55d-4c75-98be-5248124dd160 role:Primary
  disk:Inconsistent
  flex-107.dr.avaya.com role:Secondary
    replication:SyncTarget peer-disk:UpToDate done:64.12
  flex-108.dr.avaya.com role:Secondary
    peer-disk:Diskless peer-client:yes resync-suspended:dependency

Pod using this PVC is deployed on disk node with this SyncTarget resource -

[root@flex-103 ~]# k get pods -o wide | grep alarming-db-service-pgo-repo-host-0
alarming-db-service-pgo-repo-host-0                               1/1     Running     0               6h8m    10.200.56.55     flex-106.dr.avaya.com   <none>           <none>
[root@flex-103 cust]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-4vmfm -c linstor-satellite -- drbdsetup show pvc-353143c5-e55d-4c75-98be-5248124dd160 
resource "pvc-353143c5-e55d-4c75-98be-5248124dd160" {
    options {
        quorum              majority;
        on-no-quorum        io-error;
    }
    _this_host {
        node-id         0;
        volume 0 {
            device          minor 1009;
            disk            "/dev/vg_sds/pvc-353143c5-e55d-4c75-98be-5248124dd160_00000";
            meta-disk           internal;
            disk {
                rs-discard-granularity  262144; # bytes
            }
        }
    }
    connection {
        _peer_node_id 1;
        path {
            _this_host ipv4 10.129.185.106:7009;
            _remote_host ipv4 10.129.185.107:7009;
        }
        net {
            max-epoch-size      10000;
            cram-hmac-alg       "sha1";
            shared-secret       "HK+1vTym1caM9w4aR3uh";
            verify-alg          "crct10dif-pclmul";
            max-buffers         10000;
            _name               "flex-107.dr.avaya.com";
        }
    }
    connection {
        _peer_node_id 2;
        path {
            _this_host ipv4 10.129.185.106:7009;
            _remote_host ipv4 10.129.185.108:7009;
        }
        net {
            max-epoch-size      10000;
            cram-hmac-alg       "sha1";
            shared-secret       "HK+1vTym1caM9w4aR3uh";
            verify-alg          "crct10dif-pclmul";
            max-buffers         10000;
            _name               "flex-108.dr.avaya.com";
        }
        volume 0 {
            disk {
                bitmap              no;
            }
        }
    }
}
[root@flex-103 ~]# k get pods -n piraeus -o wide | grep ns
piraeus-op-piraeus-operator-ns-node-4vmfm                     2/2     Running   4 (7h35m ago)   9h      10.129.185.106   flex-106.dr.avaya.com   <none>           <none>
piraeus-op-piraeus-operator-ns-node-jrhwv                     2/2     Running   2 (168m ago)    9h      10.129.185.108   flex-108.dr.avaya.com   <none>           <none>
piraeus-op-piraeus-operator-ns-node-nhrxp                     2/2     Running   2 (171m ago)    9h      10.129.185.107   flex-107.dr.avaya.com   <none>           <none>

To avoid cluttering, I have attached all logs related to this PVC in logs here -

Here are attached drbd states, DRBD kernel logs, sc def, and linstor r l output files

drbadm-status-verbose-on-replica-disk-node-107.log drbdadm-cstate-disk-node-106.log drbdadm-cstate-diskless.log drbdadm-cstate-replica-disk-node-107.log drbdadm-dstate-disk-node-106.log drbdadm-dstate-diskless.log drbdadm-dstate-replica-disk-node-107.log drbdadm-show-resource-on-InUse.log linstor-resource-list.log node-associated-with-pod-using-this-pvc.log pvc-description.log sc-info.yaml.log diskless-node-stuck-target-108.log disk-node-stuck-target-primary-106.log disk-node-stuck-target-secondary-107.log drbadm-status-verbose-on-diskless-node-108.log drbadm-status-verbose-on-InUse-node.log

bitemarcz commented 4 months ago

How do you get the resource name for a drbd resource? Seems we are using RES: * but every name we try appears to say no valid resource name.

We also tried drbdadm down all but no resource defined is also returned.