LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
953 stars 75 forks source link

Unknown exception #243

Closed zheliazkov closed 2 years ago

zheliazkov commented 3 years ago

Hello,

I'm running 3 node cluster with Linstor and wanted to remove one of the nodes then add the same node back with a new disks on a bare new lvm group.

That are the steps I did:

  1. Removed the node
  2. Added new disks and recreated the volume group in the node
  3. Added again the node running linstor n create THE-SAME-NODE-NAME
  4. Mapped the same linstor storage pool to the LVM volume group by running linstor sp c lvm NODE-NAME-AS-BEFORE THE_SAME_OLD_LINSTOR_SP_NAME LVM_VG_NAME

Currently I'm seeing all the drbd resources in the node but no one of the linstor client commands shows that they exists as diskless.

When I try to do any the following I had no success (unknown exceptions thrown... see below):

  1. Creating one of the resources on the node - linstor r c NODE-NAME-AS-BEFORE RESOURCE-NAME
  2. Creating one of the resources on the node as diskless - linstor r c NODE-NAME-AS-BEFORE RESOURCE-NAME --drbd-diskless
  3. Toggling the drbd resource as diskbacked - linstor resource toggle-disk NODE-NAME-AS-BEFORE RESOURCE-NAME --storage-pool STORAGE-POOL-NAME-AS-BEFORE

The exception is as follows (masked some values as ###):

ERROR REPORT 60E6DA88-00000-000014

============================================================

Application:                        LINBIT® LINSTOR
Module:                             Controller
Version:                            1.11.1
Build ID:                           fe95a94d86c66c6c9846a3cf579a1a776f95d3f4
Build time:                         2021-01-13T09:48:24+00:00
Error time:                         2021-07-14 16:56:29
Node:                               ###
Peer:                               RestClient(192.168.25.3; 'PythonLinstor/1.6.0 (API1.0.4)')

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         StringIndexOutOfBoundsException
Class canonical name:               java.lang.StringIndexOutOfBoundsException
Generated at:                       Method 'setLength', Source file 'AbstractStringBuilder.java', Line #207

Error message:                      String index out of range: -2

Error context:
    Registration of resource '###' on node(s) '###' failed due to an unknown exception.

Asynchronous stage backtrace:

    Error has been observed at the following site(s):
        |_ checkpoint ⇢ Create resource
    Stack trace:

Call backtrace:

    Method                                   Native Class:Line number
    setLength                                N      java.lang.AbstractStringBuilder:207

Suppressed exception 1 of 1:
===============
Category:                           RuntimeException
Class name:                         OnAssemblyException
Class canonical name:               reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at:                       Method 'setLength', Source file 'AbstractStringBuilder.java', Line #207

Error message:                      
Error has been observed at the following site(s):
        |_ checkpoint ⇢ Create resource
Stack trace:

Error context:
    Registration of resource '###' on node(s) '###' failed due to an unknown exception.

Call backtrace:

    Method                                   Native Class:Line number
    setLength                                N      java.lang.AbstractStringBuilder:207
    setLength                                N      java.lang.StringBuilder:76
    select                                   N      com.linbit.linstor.core.apicallhandler.controller.autoplacer.Selector:185
    autoPlace                                N      com.linbit.linstor.core.apicallhandler.controller.autoplacer.Autoplacer:106
    manage                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscAutoRePlaceRscHelper:216
    manage                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscAutoHelper:99
    manage                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscAutoHelper:89
    createResourceInTransaction              N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscCrtApiCallHandler:192
    lambda$null$2                            N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscCrtApiCallHandler:143
    doInScope                                N      com.linbit.linstor.core.apicallhandler.ScopeRunner:147
    lambda$null$0                            N      com.linbit.linstor.core.apicallhandler.ScopeRunner:75
    call                                     N      reactor.core.publisher.MonoCallable:91
    trySubscribeScalarMap                    N      reactor.core.publisher.FluxFlatMap:126
    subscribeOrReturn                        N      reactor.core.publisher.MonoFlatMapMany:49
    subscribe                                N      reactor.core.publisher.Flux:8343
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2344
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
    subscribe                                N      reactor.core.publisher.MonoCurrentContext:35
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2344
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
    subscribe                                N      reactor.core.publisher.MonoCurrentContext:35
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    onNext                                   N      reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber:121
    complete                                 N      reactor.core.publisher.Operators$MonoSubscriber:1782
    onComplete                               N      reactor.core.publisher.MonoCollect$CollectSubscriber:152
    onComplete                               N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner:252
    checkTerminated                          N      reactor.core.publisher.FluxFlatMap$FlatMapMain:838
    drainLoop                                N      reactor.core.publisher.FluxFlatMap$FlatMapMain:600
    drain                                    N      reactor.core.publisher.FluxFlatMap$FlatMapMain:580
    onComplete                               N      reactor.core.publisher.FluxFlatMap$FlatMapMain:457
    checkTerminated                          N      reactor.core.publisher.FluxFlatMap$FlatMapMain:838
    drainLoop                                N      reactor.core.publisher.FluxFlatMap$FlatMapMain:600
    innerComplete                            N      reactor.core.publisher.FluxFlatMap$FlatMapMain:909
    onComplete                               N      reactor.core.publisher.FluxFlatMap$FlatMapInner:1013
    onComplete                               N      reactor.core.publisher.FluxMap$MapSubscriber:136
    onComplete                               N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2016
    onComplete                               N      reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber:78
    complete                                 N      reactor.core.publisher.FluxCreate$BaseSink:438
    drain                                    N      reactor.core.publisher.FluxCreate$BufferAsyncSink:784
    complete                                 N      reactor.core.publisher.FluxCreate$BufferAsyncSink:732
    drainLoop                                N      reactor.core.publisher.FluxCreate$SerializedSink:239
    drain                                    N      reactor.core.publisher.FluxCreate$SerializedSink:205
    complete                                 N      reactor.core.publisher.FluxCreate$SerializedSink:196
    apiCallComplete                          N      com.linbit.linstor.netcom.TcpConnectorPeer:455
    handleComplete                           N      com.linbit.linstor.proto.CommonMessageProcessor:363
    handleDataMessage                        N      com.linbit.linstor.proto.CommonMessageProcessor:287
    doProcessInOrderMessage                  N      com.linbit.linstor.proto.CommonMessageProcessor:235
    lambda$doProcessMessage$3                N      com.linbit.linstor.proto.CommonMessageProcessor:220
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.FluxFlatMap$FlatMapMain:418
    drainAsync                               N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:414
    drain                                    N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:679
    onNext                                   N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:243
    drainFused                               N      reactor.core.publisher.UnicastProcessor:286
    drain                                    N      reactor.core.publisher.UnicastProcessor:329
    onNext                                   N      reactor.core.publisher.UnicastProcessor:408
    next                                     N      reactor.core.publisher.FluxCreate$IgnoreSink:618
    drainLoop                                N      reactor.core.publisher.FluxCreate$SerializedSink:248
    next                                     N      reactor.core.publisher.FluxCreate$SerializedSink:168
    processInOrder                           N      com.linbit.linstor.netcom.TcpConnectorPeer:373
    doProcessMessage                         N      com.linbit.linstor.proto.CommonMessageProcessor:218
    lambda$processMessage$2                  N      com.linbit.linstor.proto.CommonMessageProcessor:164
    onNext                                   N      reactor.core.publisher.FluxPeek$PeekSubscriber:177
    runAsync                                 N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:439
    run                                      N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:526
    call                                     N      reactor.core.scheduler.WorkerTask:84
    call                                     N      reactor.core.scheduler.WorkerTask:37
    run                                      N      java.util.concurrent.FutureTask:266
    access$201                               N      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:180
    run                                      N      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:293
    runWorker                                N      java.util.concurrent.ThreadPoolExecutor:1149
    run                                      N      java.util.concurrent.ThreadPoolExecutor$Worker:624
    run                                      N      java.lang.Thread:748

END OF ERROR REPORT.

Also tried to set a storage pool to the DfltRscGrp in which the resource I'm trying to replicate/create is in by running linstor rg m DfltRscGrp --storage-pool POOL-NAME but it don't make difference.

And something in that context - The linstor rg m --help tells that the command should be like linstor rg m --storage-pool POOL-NAME RESOURCE-GROUP but in real it works only as linstor rg m RESOURCE-GROUP --storage-pool POOL-NAME

The questions in my head are:

  1. Did I made something wrong and what?
  2. Is there a way to automatically force the "recreation" of the resources to meet the place-count values?

I can give more info if needed.

Thanks in advance.

BR, Plamen

raltnoeder commented 3 years ago

@ghernadi, fixed in

647cc87eef9013654baf4fd226cdc3012b2e3ebc, alternate fix ee39452e23363fe662b97e1378dda55d651f2f82,

correction: 647cc87eef9013654baf4fd226cdc3012b2e3ebc, alternate fix 32b9c9508faecab08c62eb15a37029eb24563daf, pick whichever you like better.

Not sure how it ended up there though, because the bug is triggered by currentSelection being an empty set, and the code path is only entered if currentSelection.size() == replicaCount, so that implies replicaCount == 0. Doesn't seem to make a whole lot of sense, something shady might have happened before in the code that defines replicaCount and selects the storage pools. Recommend to double-check those code sections too.