LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
984 stars 76 forks source link

Can't remove Unknown device due to unknown exception #353

Open kvaps opened 1 year ago

kvaps commented 1 year ago

I have weird device which I can't remove:

root@kube-dev-hv-3:/# linstor r l -r pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node          ┊ Port ┊ Usage  ┊ Conns                     ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 ┊ kube-dev-hv-0 ┊ 7015 ┊ InUse  ┊ Connecting(kube-dev-hv-3) ┊ Diskless ┊ 2023-04-24 09:12:09 ┊
┊ pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 ┊ kube-dev-hv-1 ┊ 7015 ┊ Unused ┊ Ok                        ┊ UpToDate ┊ 2023-04-24 08:58:16 ┊
┊ pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 ┊ kube-dev-hv-3 ┊ 7015 ┊        ┊                           ┊  Unknown ┊ 2023-04-24 08:58:20 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
root@kube-dev-hv-3:/# linstor r d kube-dev-hv-3 pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
INFO:
    Resource-definition property 'DrbdOptions/Resource/quorum' was removed as there are not enough resources for quorum
INFO:
    Resource-definition property 'DrbdOptions/Resource/on-no-quorum' was removed as there are not enough resources for quorum
SUCCESS:
Description:
    Node: kube-dev-hv-3, Resource: pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 preparing for deletion.
Details:
    Node: kube-dev-hv-3, Resource: pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 UUID is: 2629b9df-1d12-4299-9e5c-dcafe2d01fdb
SUCCESS:
    Preparing deletion of resource on 'kube-dev-hv-0'
ERROR:
Description:
    (Node: 'kube-dev-hv-1') Shutdown of the DRBD resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 failed
Cause:
    The external command for stopping the DRBD resource failed
Correction:
    - Check whether the required software is installed
    - Check whether the application's search path includes the location
      of the external software
    - Check whether the application has execute permission for the external command
Show reports:
    linstor error-reports show 64478C81-22600-000184
SUCCESS:
    Preparing deletion of resource on 'kube-dev-hv-3'
ERROR:
Description:
    Deletion of resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2' on node 'kube-dev-hv-3' failed due to an unknown exception.
Details:
    Node: kube-dev-hv-3, Resource: pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
Show reports:
    linstor error-reports show 64478CA0-00000-000019

Device is missing in LV

root@kube-dev-hv-3:/# linstor error-reports show 64478CA0-00000-000019
ERROR REPORT 64478CA0-00000-000019

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.21.0
Build ID:                           b44bb8d41f264ac1089d9a0a1c540d3cc703d7e8
Build time:                         2023-04-04T10:11:03+00:00
Error time:                         2023-04-25 09:48:33
Node:                               linstor-controller-79cd4f4b5d-n4h2c
Peer:                               RestClient(192.168.100.13; 'PythonLinstor/1.17.0 (API1.0.4): Client 1.17.0')

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         DelayedApiRcException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils.DelayedApiRcException
Generated at:                       Method 'lambda$mergeExtractingApiRcExceptions$4', Source file 'CtrlResponseUtils.java', Line #126

Error message:                      Exceptions have been converted to responses

Error context:
    Deletion of resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2' on node 'kube-dev-hv-3' failed due to an unknown exception.

Asynchronous stage backtrace:
    (Node: 'kube-dev-hv-1') Shutdown of the DRBD resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 failed

    Error has been observed at the following site(s):
        |_ checkpoint ? Prepare resource delete
        |_ checkpoint ? Activating resource if necessary before deletion
    Stack trace:

Call backtrace:

    Method                                   Native Class:Line number
    lambda$mergeExtractingApiRcExceptions$4  N      com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils:126

Suppressed exception 1 of 2:
===============
Category:                           RuntimeException
Class name:                         ApiRcException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.ApiRcException
Generated at:                       Method 'handleAnswer', Source file 'CommonMessageProcessor.java', Line #337

Error message:                      (Node: 'kube-dev-hv-1') Shutdown of the DRBD resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 failed

Error context:
    Deletion of resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2' on node 'kube-dev-hv-3' failed due to an unknown exception.

ApiRcException entries:
Nr: 1
  Message: (Node: 'kube-dev-hv-1') Shutdown of the DRBD resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 failed
  Cause: The external command for stopping the DRBD resource failed
  Correction: - Check whether the required software is installed
- Check whether the application's search path includes the location
  of the external software
- Check whether the application has execute permission for the external command

Call backtrace:

    Method                                   Native Class:Line number
    handleAnswer                             N      com.linbit.linstor.proto.CommonMessageProcessor:337
    handleDataMessage                        N      com.linbit.linstor.proto.CommonMessageProcessor:284
    doProcessInOrderMessage                  N      com.linbit.linstor.proto.CommonMessageProcessor:235
    lambda$doProcessMessage$3                N      com.linbit.linstor.proto.CommonMessageProcessor:220
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.FluxFlatMap$FlatMapMain:418
    drainAsync                               N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:414
    drain                                    N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:679
    onNext                                   N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:243
    drainFused                               N      reactor.core.publisher.UnicastProcessor:286
    drain                                    N      reactor.core.publisher.UnicastProcessor:329
    onNext                                   N      reactor.core.publisher.UnicastProcessor:408
    next                                     N      reactor.core.publisher.FluxCreate$IgnoreSink:618
    drainLoop                                N      reactor.core.publisher.FluxCreate$SerializedSink:248
    next                                     N      reactor.core.publisher.FluxCreate$SerializedSink:168
    processInOrder                           N      com.linbit.linstor.netcom.TcpConnectorPeer:388
    doProcessMessage                         N      com.linbit.linstor.proto.CommonMessageProcessor:218
    lambda$processMessage$2                  N      com.linbit.linstor.proto.CommonMessageProcessor:164
    onNext                                   N      reactor.core.publisher.FluxPeek$PeekSubscriber:177
    runAsync                                 N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:439
    run                                      N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:526
    call                                     N      reactor.core.scheduler.WorkerTask:84
    call                                     N      reactor.core.scheduler.WorkerTask:37
    run                                      N      java.util.concurrent.FutureTask:264
    run                                      N      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
    runWorker                                N      java.util.concurrent.ThreadPoolExecutor:1128
    run                                      N      java.util.concurrent.ThreadPoolExecutor$Worker:628
    run                                      N      java.lang.Thread:829

Suppressed exception 2 of 2:
===============
Category:                           RuntimeException
Class name:                         OnAssemblyException
Class canonical name:               reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at:                       Method 'lambda$mergeExtractingApiRcExceptions$4', Source file 'CtrlResponseUtils.java', Line #126

Error message:
Error has been observed at the following site(s):
    |_ checkpoint ��� Prepare resource delete
    |_ checkpoint ��� Activating resource if necessary before deletion
Stack trace:

Error context:
    Deletion of resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2' on node 'kube-dev-hv-3' failed due to an unknown exception.

Call backtrace:

    Method                                   Native Class:Line number
    lambda$mergeExtractingApiRcExceptions$4  N      com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils:126
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8357
    onComplete                               N      reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber:207
    onComplete                               N      reactor.core.publisher.FluxMap$MapSubscriber:136
    checkTerminated                          N      reactor.core.publisher.FluxFlatMap$FlatMapMain:838
    drainLoop                                N      reactor.core.publisher.FluxFlatMap$FlatMapMain:600
    innerComplete                            N      reactor.core.publisher.FluxFlatMap$FlatMapMain:909
    onComplete                               N      reactor.core.publisher.FluxFlatMap$FlatMapInner:1013
    onComplete                               N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2016
    onComplete                               N      reactor.core.publisher.FluxMap$MapSubscriber:136
    onComplete                               N      reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber:191
    onComplete                               N      reactor.core.publisher.MonoIgnoreElements$IgnoreElementsSubscriber:81
    onComplete                               N      reactor.core.publisher.FluxPeek$PeekSubscriber:252
    onComplete                               N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2016
    onComplete                               N      reactor.core.publisher.FluxMap$MapSubscriber:136
    onComplete                               N      reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber:78
    complete                                 N      reactor.core.publisher.FluxCreate$BaseSink:438
    drain                                    N      reactor.core.publisher.FluxCreate$BufferAsyncSink:784
    complete                                 N      reactor.core.publisher.FluxCreate$BufferAsyncSink:732
    drainLoop                                N      reactor.core.publisher.FluxCreate$SerializedSink:239
    drain                                    N      reactor.core.publisher.FluxCreate$SerializedSink:205
    complete                                 N      reactor.core.publisher.FluxCreate$SerializedSink:196
    apiCallComplete                          N      com.linbit.linstor.netcom.TcpConnectorPeer:470
    handleComplete                           N      com.linbit.linstor.proto.CommonMessageProcessor:363
    handleDataMessage                        N      com.linbit.linstor.proto.CommonMessageProcessor:287
    doProcessInOrderMessage                  N      com.linbit.linstor.proto.CommonMessageProcessor:235
    lambda$doProcessMessage$3                N      com.linbit.linstor.proto.CommonMessageProcessor:220
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.FluxFlatMap$FlatMapMain:418
    drainAsync                               N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:414
    drain                                    N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:679
    onNext                                   N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:243
    drainFused                               N      reactor.core.publisher.UnicastProcessor:286
    drain                                    N      reactor.core.publisher.UnicastProcessor:329
    onNext                                   N      reactor.core.publisher.UnicastProcessor:408
    next                                     N      reactor.core.publisher.FluxCreate$IgnoreSink:618
    drainLoop                                N      reactor.core.publisher.FluxCreate$SerializedSink:248
    next                                     N      reactor.core.publisher.FluxCreate$SerializedSink:168
    processInOrder                           N      com.linbit.linstor.netcom.TcpConnectorPeer:388
    doProcessMessage                         N      com.linbit.linstor.proto.CommonMessageProcessor:218
    lambda$processMessage$2                  N      com.linbit.linstor.proto.CommonMessageProcessor:164
    onNext                                   N      reactor.core.publisher.FluxPeek$PeekSubscriber:177
    runAsync                                 N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:439
    run                                      N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:526
    call                                     N      reactor.core.scheduler.WorkerTask:84
    call                                     N      reactor.core.scheduler.WorkerTask:37
    run                                      N      java.util.concurrent.FutureTask:264
    run                                      N      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
    runWorker                                N      java.util.concurrent.ThreadPoolExecutor:1128
    run                                      N      java.util.concurrent.ThreadPoolExecutor$Worker:628
    run                                      N      java.lang.Thread:829

END OF ERROR REPORT.

The wierd thing is that the device on kube-dev-hv-0 is in DfltDisklessStorPool but shown as unintentional diskless:

root@kube-dev-hv-0:/# linstor v l -r pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Node          | Resource                                 | StoragePool          | VolNr | MinorNr | DeviceName    |  Allocated | InUse  |    State |
|====================================================================================================================================================|
| kube-dev-hv-0 | pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | DfltDisklessStorPool |     0 |    1022 | /dev/drbd1022 |            | InUse  | Diskless |
| kube-dev-hv-1 | pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | thindata             |     0 |    1022 | None          | 208.04 GiB | Unused | UpToDate |
| kube-dev-hv-3 | pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | DfltDisklessStorPool |     0 |    1022 | None          |            |        |  Unknown |
+----------------------------------------------------------------------------------------------------------------------------------------------------+

It is really missing in lvs output:

root@kube-dev-hv-0:/# lvs | grep pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
kvaps commented 1 year ago

I fixed this issue by manually editing db, using solution from https://github.com/LINBIT/linstor-server/issues/348#issuecomment-1507646983

I made backup:

kubectl get crds | grep -o ".*.internal.linstor.linbit.com" |   xargs kubectl get crds -ojson > crds.json
kubectl get crds | grep -o ".*.internal.linstor.linbit.com" | xargs -i{} sh -xc "kubectl get {} -ojson > {}.json

Collected all devices with weird flags:

cat resources.internal.linstor.linbit.com.json  | jq '.items[] | select(.spec.resource_flags!=0 and .spec.resource_flags!=260 and .spec.resource_flags!=388) | "\(.spec.resource_name) \(.spec.node_name) \(.spec.resource_flags)"' -r > list.txt

the output was:

PVC-9B46A955-EDD9-4E08-9C3A-4D2849F9AFE2 KUBE-DEV-HV-1 262144
PVC-9B46A955-EDD9-4E08-9C3A-4D2849F9AFE2 KUBE-DEV-HV-3 264548

Then reseted them to 260 diskless+drbd_diskless

while read res node flags; do cat resources.internal.linstor.linbit.com.json  | jq '.items[] | select(.spec.resource_name==$res and .spec.node_name==$node) | .spec.resource_flags=260' --arg res $res --arg node $node -r ; done < list.txt > fix.json

The updated resources:

{
  "apiVersion": "internal.linstor.linbit.com/v1-15-0",
  "kind": "Resources",
  "metadata": {
    "creationTimestamp": "2023-04-24T08:58:14Z",
    "generation": 3,
    "name": "b819a9c8efb44e21117511f96625296df9e633c2147e86b43c022ee60e984164",
    "resourceVersion": "138483524",
    "uid": "cfc0af48-59e8-4300-a30e-0b84b0667fe3"
  },
  "spec": {
    "create_timestamp": 1682326696079,
    "node_name": "KUBE-DEV-HV-1",
    "resource_flags": 260,
    "resource_name": "PVC-9B46A955-EDD9-4E08-9C3A-4D2849F9AFE2",
    "snapshot_name": "",
    "uuid": "e2a2b3fe-d2e4-43c2-8ac8-12bfcd2d85f2"
  }
}
{
  "apiVersion": "internal.linstor.linbit.com/v1-15-0",
  "kind": "Resources",
  "metadata": {
    "creationTimestamp": "2023-04-24T08:58:18Z",
    "generation": 7,
    "name": "d08533f74aa6c3d02f647feebdef0d3b691ae99385da62ce704425a14fd376e6",
    "resourceVersion": "138574025",
    "uid": "2febb78b-24e6-410e-84c6-db9eedfa92c7"
  },
  "spec": {
    "create_timestamp": 1682326700754,
    "node_name": "KUBE-DEV-HV-3",
    "resource_flags": 260,
    "resource_name": "PVC-9B46A955-EDD9-4E08-9C3A-4D2849F9AFE2",
    "snapshot_name": "",
    "uuid": "2629b9df-1d12-4299-9e5c-dcafe2d01fdb"
  }
}

After applying fix.json and starting linstor-controller they turned into green:

# linstor r l -r pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
+-------------------------------------------------------------------------------------------------------------------+
| ResourceName                             | Node          | Port | Usage  | Conns |    State | CreatedOn           |
|===================================================================================================================|
| pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | kube-dev-hv-0 | 7015 | InUse  | Ok    | Diskless | 2023-04-24 09:12:09 |
| pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | kube-dev-hv-1 | 7015 | Unused | Ok    | UpToDate | 2023-04-24 08:58:16 |
| pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | kube-dev-hv-3 | 7015 | Unused | Ok    | Diskless | 2023-04-24 08:58:20 |
+-------------------------------------------------------------------------------------------------------------------+

cc @ghernadi

ghernadi commented 1 year ago

Good that everything is green again (although direct DB manipulations are usually rather scary...).

However, I'd be interested in the error report 64478C81-22600-000184 to figure out why DRBD was not able to down the resource properly. Anything else suspicious in dmesg or other logs?

The preferred way of fixing such an issue is obviously without manipulating the database directly. That means to help Linstor to get rid of the resource properly. Since the DRBD_DELETE flag was set in both resources, I assume something went wrong there.

Of course the fact that one of the resources was "in the middle of a toggle disk" does not help in properly cleaning up such a state. The interesting question is (as always) how we got into such a state.

kvaps commented 1 year ago

Unfortunately I have no error report anymore. Some logs were collected from the node:

drbd.log

I see some NetworkFailure's, I guess this can be reason