Open kvaps opened 2 years ago
To reproduce:
#!/bin/sh
kubectl delete sc piraeus-ssd
for INSTANCE in $(seq 1 100); do
kubectl create -f- <<EOT
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc${INSTANCE}
labels:
app: "test"
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block
storageClassName: piraeus-ssd
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Pod
metadata:
name: my-pod${INSTANCE}
labels:
app: "test"
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- test
topologyKey: "kubernetes.io/hostname"
containers:
- name: my-container
image: alpine:3.14
imagePullPolicy: IfNotPresent
command:
- sleep
- infinity
volumeDevices:
- devicePath: /dev/xvda
name: my-volume
volumes:
- name: my-volume
persistentVolumeClaim:
claimName: my-pvc${INSTANCE}
terminationGracePeriodSeconds: 0
EOT
done
kubectl create -f- <<EOT
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: piraeus-ssd
parameters:
autoPlace: "2"
storagePool: lvm
provisioner: linstor.csi.linbit.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
EOT
Or even:
#!/bin/sh
kubectl delete sc piraeus-ssd
for INSTANCE in $(seq 1 100); do
kubectl create -f- <<EOT
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc${INSTANCE}
labels:
app: "test"
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block
storageClassName: piraeus-ssd
resources:
requests:
storage: 10Gi
EOT
done
kubectl create -f- <<EOT
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: piraeus-ssd
parameters:
autoPlace: "2"
storagePool: lvm
provisioner: linstor.csi.linbit.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
EOT
should be enough
Another problem with the different device:
linstor r l -r pvc-fbdd98f5-492a-4971-a72f-998bbe95d027
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-fbdd98f5-492a-4971-a72f-998bbe95d027 ┊ hf-kubevirt-01 ┊ 7013 ┊ ┊ ┊ Unknown ┊ ┊
┊ pvc-fbdd98f5-492a-4971-a72f-998bbe95d027 ┊ hf-kubevirt-02 ┊ 7013 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
┊ pvc-fbdd98f5-492a-4971-a72f-998bbe95d027 ┊ hf-kubevirt-03 ┊ 7013 ┊ Unused ┊ ┊ Unknown ┊ 2022-01-27 13:34:27 ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
13:29:58.763 [grizzly-http-server-1] INFO LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' created.
13:31:59.226 [grizzly-http-server-1] INFO LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' created.
13:32:25.466 [MainWorkerPool-1] ERROR LINSTOR/Controller - SYSTEM - Resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' on node 'hf-kubevirt-02' not found. [Report number 61F298C8-00000-000905]
13:32:49.266 [MainWorkerPool-1] WARN LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' of node 'hf-kubevirt-02' added for retry.
13:34:15.677 [TaskScheduleService] ERROR LINSTOR/Controller - SYSTEM - The resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' was already deployed on 3 nodes: 'hf-kubevirt-01', 'hf-kubevirt-02', 'hf-kubevirt-03'. The resource would have to be deleted from nodes to reach the placement count. [Report number 61F298C8-00000-001356]
13:34:26.769 [TaskScheduleService] ERROR LINSTOR/Controller - SYSTEM - The resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' was already deployed on 3 nodes: 'hf-kubevirt-01', 'hf-kubevirt-02', 'hf-kubevirt-03'. The resource would have to be deleted from nodes to reach the placement count. [Report number 61F298C8-00000-001399]
13:34:40.911 [TaskScheduleService] ERROR LINSTOR/Controller - SYSTEM - The resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' was already deployed on 3 nodes: 'hf-kubevirt-01', 'hf-kubevirt-02', 'hf-kubevirt-03'. The resource would have to be deleted from nodes to reach the placement count. [Report number 61F298C8-00000-001464]
13:34:43.725 [MainWorkerPool-1] WARN LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' of node 'hf-kubevirt-01' added for retry.
13:34:58.850 [TaskScheduleService] ERROR LINSTOR/Controller - SYSTEM - The resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' was already deployed on 3 nodes: 'hf-kubevirt-01', 'hf-kubevirt-02', 'hf-kubevirt-03'. The resource would have to be deleted from nodes to reach the placement count. [Report number 61F298C8-00000-001527]
13:35:30.208 [TaskScheduleService] ERROR LINSTOR/Controller - SYSTEM - The resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' was already deployed on 3 nodes: 'hf-kubevirt-01', 'hf-kubevirt-02', 'hf-kubevirt-03'. The resource would have to be deleted from nodes to reach the placement count. [Report number 61F298C8-00000-001592]
13:35:38.179 [TaskScheduleService] ERROR LINSTOR/Controller - SYSTEM - The resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' was already deployed on 3 nodes: 'hf-kubevirt-01', 'hf-kubevirt-02', 'hf-kubevirt-03'. The resource would have to be deleted from nodes to reach the placement count. [Report number 61F298C8-00000-001646]
13:35:41.782 [MainWorkerPool-1] ERROR LINSTOR/Controller - SYSTEM - The resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' was already deployed on 3 nodes: 'hf-kubevirt-01', 'hf-kubevirt-02', 'hf-kubevirt-03'. The resource would have to be deleted from nodes to reach the placement count. [Report number 61F298C8-00000-001712]
13:35:55.992 [TaskScheduleService] ERROR LINSTOR/Controller - SYSTEM - The resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' was already deployed on 3 nodes: 'hf-kubevirt-01', 'hf-kubevirt-02', 'hf-kubevirt-03'. The resource would have to be deleted from nodes to reach the placement count. [Report number 61F298C8-00000-001830]
13:36:42.806 [TaskScheduleService] ERROR LINSTOR/Controller - SYSTEM - The resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' was already deployed on 3 nodes: 'hf-kubevirt-01', 'hf-kubevirt-02', 'hf-kubevirt-03'. The resource would have to be deleted from nodes to reach the placement count. [Report number 61F298C8-00000-002025]
13:36:49.593 [MainWorkerPool-1] ERROR LINSTOR/Controller - SYSTEM - The resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' was already deployed on 3 nodes: 'hf-kubevirt-01', 'hf-kubevirt-02', 'hf-kubevirt-03'. The resource would have to be deleted from nodes to reach the placement count. [Report number 61F298C8-00000-002052]
13:37:19.282 [MainWorkerPool-1] ERROR LINSTOR/Controller - SYSTEM - (Node: 'hf-kubevirt-02') Generated resource file for resource 'pvc-fbdd98f5-492a-4971-a72f-998bbe95d027' is invalid. [Report number 61F298C8-00000-002118]
# diff /var/lib/linstor.d/pvc-fbdd98f5-492a-4971-a72f-998bbe95d027.res /var/lib/linstor.d/pvc-fbdd98f5-492a-4971-a72f-998bbe95d027.res_tmp
10c10,11
< quorum off;
---
> on-no-quorum io-error;
> quorum majority;
16c17
< shared-secret "n8fEwi3XXZRtMKzhsoWn";
---
> shared-secret "9EgIXgOal126vPZmJCFT";
29c30
< device minor 1020;
---
> device minor 1026;
31a33,74
> }
>
> on hf-kubevirt-01
> {
> volume 0
> {
> disk /dev/drbd/this/is/not/used;
> disk
> {
> discard-zeroes-if-aligned yes;
> }
> meta-disk internal;
> device minor 1026;
> }
> node-id 1;
> }
>
> on hf-kubevirt-03
> {
> volume 0
> {
> disk none;
> disk
> {
> discard-zeroes-if-aligned yes;
> }
> meta-disk internal;
> device minor 1026;
> }
> node-id 2;
> }
>
> connection
> {
> host hf-kubevirt-02 address ipv4 192.168.242.38:7013;
> host hf-kubevirt-01 address ipv4 192.168.242.35:7013;
> }
>
> connection
> {
> host hf-kubevirt-02 address ipv4 192.168.242.38:7013;
> host hf-kubevirt-03 address ipv4 192.168.242.37:7013;
WAIDW?
Just reproduced this bug on clean cluster:
https://asciinema.org/a/RKgx4fV1BdVTkcAYJXZ7GU0AX?t=80
no errors nor on controller and satellite, only in dmesg:
[Thu Dec 22 13:04:02 2022] drbd pvc-8e7c653f-7458-4d0a-a373-aec594215561: State change failed: Can not start OV/resync since it is already active
[Thu Dec 22 13:04:02 2022] drbd pvc-8e7c653f-7458-4d0a-a373-aec594215561/0 drbd1000 gpnvkc-w3: Failed: resync-susp( connection dependency -> no )
[Thu Dec 22 13:04:02 2022] drbd pvc-8e7c653f-7458-4d0a-a373-aec594215561/0 drbd1000 gpnvkc-w1: Failed: repl( SyncTarget -> WFBitMapT )
[Thu Dec 22 13:04:02 2022] drbd pvc-8e7c653f-7458-4d0a-a373-aec594215561/0 drbd1000 gpnvkc-s2: Failed: resync-susp( connection dependency -> no )
linstor 1.20.0; drbd 9.2.0
The problems in this issue look like they have different underlying causes to me.
https://github.com/LINBIT/linstor-server/issues/268#issue-1115464857 (initial issue) - looks like LINSTOR is failing to promote. That's an old version now, may be fixed already.
https://github.com/LINBIT/linstor-server/issues/268#issuecomment-1023226704 ("Another problem with the different device") - looks like something at LINSTOR level too. May also be fixed already.
https://github.com/LINBIT/linstor-server/issues/268#issuecomment-1362824963 ("Just reproduced this bug on clean cluster") - A stuck resync at DRBD level. The part you quoted is a recoverable problem. A state change fails and is postponed: "...postponing this until current resync finished". The reason your device is stuck in state Inconsistent is that gpnvkc-w2 is SyncTarget towards gpnvkc-w1 and isn't making any progress. Not sure why. Try DRBD 9.1.12. If you can reproduce this reliably then we might be able to fix it.
Hi I just faced the bug and I think I know how to reproduce it
I have LINSTOR on three nodes and 2xNVME on each of them.
OS: Ubuntu 20.04.3 LTS kernel: 5.13.0-27-generic DRBD version: 9.1.4 (api:2/proto:110-121) LINSTOR version: 1.17.0
LINSTOR installed using priaeus-operator.
I created three LVM pools:
And 51 pod with 10GB volumes on each of them:
48 volumes provisioned without problems, but three of them are stuck on Inconsistent state:
I fixed these devices by running:
linstor-controller log is full of these messages even for other successfully created resources:
linstor err show 61F188C3-00000-000008
csi-controller log:
csi-provisioner
linstor-csi-plugin
linstor-satellites log:
hf-kubevirt-01
hf-kubevirt-02
hf-kubevirt-03
dmesg from nodes:
hf-kubevirt-01
hf-kubevirt-02
hf-kubevirt-03