Open Dipanshu-Sehjal opened 2 years ago
What is the output of drbdsetup status
on the Satellite pod for the node with the Consistent resource in this situation? It may be that linstor has missed the state change, even though it is fine at the DRBD level.
Sorry, I don't have the setup now anymore. I will add logs if I see it again. Is there anything else that I should collect next time?
From time to time my set of HA testing on DRBD has shown that some DRBD resources go into the "Consistent" state when a node carrying them is powered off and powered on after a prolonged period of time.
Setup 1 -
This happens 7/10 times on the following setup -
3 nodes with 1 disk node and 2 diskless nodes. Replication and auto-eviction are turned off.
Test case procedures -
linstor resource list
shows some resources inConsistent
state while theirDiskless
counterpart is Usage =InUse
There are only 2 replicas associated with a single resource - Diskless and Disk.
As a workaround,
kubectl exec -n piraeus <ns_pod-name> -c linstor-satellite -- drbdsetup disconnect <Consistent-state-resource-name> <node-id of diskless peer>
kubectl exec -n piraeus <ns_pod-name> -c linstor-satellite -- drbdsetup connect <Consistent-state-resource-name> <node-id of diskless peer>
This brings back the resource into UpToDate state. BUT sometimes this workaround puts the resource into
Outdated
and then it becomes a totally different problem to solve which I don't know how to recover from when this is the only physical replica available on a cluster and the Diskless resource is connected to it.Setup 2 -
This issue happens almost 2/10 times on a 3 node cluster with 2 disk nodes and 1 diskless node. Replication is turned on and auto-eviction is turned off.
Test case procedures -
linstor resource list
shows some resources inConsistent
state. This one is easy to get by because replication is turned on so the other replica becomes Primary and starts to serve data. So, I can usedrbdsetup disconnect and connect
and even delete this resource when it goes into anOutdated
state.However, it is not straightforward if the application replicates data by itself and does not use DRBD for replication. In other words, DRBD replication is turned off for this resource which goes back to a situation similar to Setup 1
For instance on a node cluster with 1 disk and 2 diskless nodes -
linstor resource list Please see below for a complete set of resources. pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac is marked as Consistent state
k exec --namespace=piraeus deployment/piraeus-op-piraeus-operator-cs-controller -- linstor r l +--------------------------------------------------------------------------------------------------------------------------------+ | ResourceName | Node | Port | Usage | Conns | State | CreatedOn | |================================================================================================================================| | pvc-5bac2d82-9e42-4e0d-b828-3c598f2d8795 | flex188-126.dr.avaya.com | 7009 | Unused | Ok | UpToDate | 2022-03-22 04:58:52 | | pvc-5bac2d82-9e42-4e0d-b828-3c598f2d8795 | flex188-128.dr.avaya.com | 7009 | InUse | Ok | Diskless | 2022-03-22 04:58:53 | | pvc-16c0b34b-bed3-4219-8b9f-415e8d1734fb | flex188-126.dr.avaya.com | 7005 | InUse | Ok | UpToDate | 2022-03-22 04:56:35 | | pvc-19dc5cea-733a-41b3-bd83-a2c4ea5012da | flex188-126.dr.avaya.com | 7004 | InUse | Ok | UpToDate | 2022-03-22 04:56:31 | | pvc-32e7a7bf-f0c2-4bca-941b-102780fcf7bd | flex188-126.dr.avaya.com | 7003 | Unused | Ok | UpToDate | 2022-03-22 05:49:52 | | pvc-32e7a7bf-f0c2-4bca-941b-102780fcf7bd | flex188-128.dr.avaya.com | 7003 | InUse | Ok | Diskless | 2022-03-22 05:49:54 | | pvc-367dca54-39c8-415e-9633-0295730bbd44 | flex188-126.dr.avaya.com | 7002 | InUse | Ok | UpToDate | 2022-03-21 04:47:47 | | pvc-69090137-263d-4cca-b402-02fd5f377041 | flex188-126.dr.avaya.com | 7008 | Unused | Ok | UpToDate | 2022-03-22 04:56:48 | | pvc-69090137-263d-4cca-b402-02fd5f377041 | flex188-128.dr.avaya.com | 7008 | InUse | Ok | Diskless | 2022-03-22 04:56:52 | | pvc-a36f02aa-da1a-4eaf-b797-b035dfcd5a22 | flex188-126.dr.avaya.com | 7006 | Unused | Ok | UpToDate | 2022-03-22 04:56:36 | | pvc-a36f02aa-da1a-4eaf-b797-b035dfcd5a22 | flex188-127.dr.avaya.com | 7006 | InUse | Ok | Diskless | 2022-03-22 04:56:38 | | pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac | flex188-126.dr.avaya.com | 7012 | Unused | Ok | Consistent | 2022-03-22 06:22:05 | | pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac | flex188-128.dr.avaya.com | 7012 | InUse | Ok | Diskless | 2022-03-22 16:20:24 | | pvc-c806c751-1efd-405b-bf63-22c7ffc53ede | flex188-126.dr.avaya.com | 7007 | InUse | Ok | UpToDate | 2022-03-22 04:56:48 | | pvc-cf52d305-926d-40b3-95b8-72c07b623d19 | flex188-126.dr.avaya.com | 7001 | InUse | Ok | UpToDate | 2022-03-21 04:47:46 | | pvc-d39187c2-d560-42db-8dc3-c6e57505ae72 | flex188-126.dr.avaya.com | 7000 | Unused | Ok | UpToDate | 2022-03-21 04:07:50 | | pvc-d39187c2-d560-42db-8dc3-c6e57505ae72 | flex188-127.dr.avaya.com | 7000 | InUse | Ok | Diskless | 2022-03-21 04:07:53 | +--------------------------------------------------------------------------------------------------------------------------------+
k exec -n piraeus piraeus-op-piraeus-operator-ns-node-cv9qs -c linstor-satellite -- drbdadm dstate pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac
Consistent/Diskless
k exec -n piraeus piraeus-op-piraeus-operator-ns-node-cv9qs -c linstor-satellite -- drbdadm cstate pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac
Connected
k exec -n piraeus piraeus-op-piraeus-operator-ns-node-cv9qs -c linstor-satellite -- drbdadm dump pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac
Software version -
k exec --namespace=piraeus deployment/piraeus-op-piraeus-operator-cs-controller -- linstor --version
linstor 1.13.0; GIT-hash: 840cf57c75c166659509e22447b2c0ca6377ee6dk exec --namespace=piraeus deployment/piraeus-op-piraeus-operator-cs-controller -- drbdadm -V
DRBDADM_BUILDTAG=GIT-hash:\ 087ee6b4961ca154d76e4211223b03149373bed8\ build\ by\ @buildsystem\,\ 2022-01-28\ 12:19:33 DRBDADM_API_VERSION=2 DRBD_KERNEL_VERSION_CODE=0x090106 DRBD_KERNEL_VERSION=9.1.6 DRBDADM_VERSION_CODE=0x091402 DRBDADM_VERSION=9.20.2piraeus-operator-1.8.0
uname -a
4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022 x86_64 x86_64 x86_64 GNU/Linux