LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
955 stars 76 forks source link

Target decrypted buffer is too small! #219

Open kvaps opened 3 years ago

kvaps commented 3 years ago
# linstor c v
linstor controller 1.11.0; GIT-hash: 3367e32d0fa92515efe61f6963767700a8701d98
# linstor err list
ERROR:
    While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.
Show reports:
    linstor error-reports show 60140BD5-00000-000619
# linstor error-reports show 60140BD5-00000-000619
ERROR REPORT 60140BD5-00000-000619

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.11.0
Build ID:                           3367e32d0fa92515efe61f6963767700a8701d98
Build time:                         2020-12-22T17:52:40+00:00
Error time:                         2021-02-02 14:36:48
Node:                               linstor-controller-5488d7d845-v8557

============================================================

Reported error:
===============

Category:                           Exception
Class name:                         InvalidProtocolBufferException
Class canonical name:               com.google.protobuf.InvalidProtocolBufferException
Generated at:                       Method 'truncatedMessage', Source file 'InvalidProtocolBufferException.java', Line #84

Error message:                      While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.

Asynchronous stage backtrace:

    Error has been observed at the following site(s):
        |_ checkpoint ? Assemble error report list
    Stack trace:

Call backtrace:

    Method                                   Native Class:Line number
    truncatedMessage                         N      com.google.protobuf.InvalidProtocolBufferException:84

Suppressed exception 1 of 1:
===============
Category:                           RuntimeException
Class name:                         OnAssemblyException
Class canonical name:               reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at:                       Method 'truncatedMessage', Source file 'InvalidProtocolBufferException.java', Line #84

Error message:                      
Error has been observed at the following site(s):
    |_ checkpoint ��� Assemble error report list
Stack trace:

Call backtrace:

    Method                                   Native Class:Line number
    truncatedMessage                         N      com.google.protobuf.InvalidProtocolBufferException:84
    readRawBytesSlowPathOneChunk             N      com.google.protobuf.CodedInputStream$StreamDecoder:2923
    readBytesSlowPath                        N      com.google.protobuf.CodedInputStream$StreamDecoder:2974
    readBytes                                N      com.google.protobuf.CodedInputStream$StreamDecoder:2386
    <init>                                   N      com.linbit.linstor.proto.responses.MsgErrorReportOuterClass$MsgErrorReport:346
    <init>                                   N      com.linbit.linstor.proto.responses.MsgErrorReportOuterClass$MsgErrorReport:247
    parsePartialFrom                         N      com.linbit.linstor.proto.responses.MsgErrorReportOuterClass$MsgErrorReport$1:2516
    parsePartialFrom                         N      com.linbit.linstor.proto.responses.MsgErrorReportOuterClass$MsgErrorReport$1:2510
    parsePartialFrom                         N      com.google.protobuf.AbstractParser:215
    parsePartialDelimitedFrom                N      com.google.protobuf.AbstractParser:255
    parseDelimitedFrom                       N      com.google.protobuf.AbstractParser:267
    parseDelimitedFrom                       N      com.google.protobuf.AbstractParser:272
    parseDelimitedFrom                       N      com.google.protobuf.AbstractParser:48
    parseDelimitedWithIOException            N      com.google.protobuf.GeneratedMessageV3:375
    parseDelimitedFrom                       N      com.linbit.linstor.proto.responses.MsgErrorReportOuterClass$MsgErrorReport:1219
    deserializeErrorReports                  N      com.linbit.linstor.core.apicallhandler.controller.CtrlErrorListApiCallHandler:310
    assembleList                             N      com.linbit.linstor.core.apicallhandler.controller.CtrlErrorListApiCallHandler:297
    lambda$listErrorReports$11               N      com.linbit.linstor.core.apicallhandler.controller.CtrlErrorListApiCallHandler:205
    doInScope                                N      com.linbit.linstor.core.apicallhandler.ScopeRunner:147
    lambda$fluxInScope$0                     N      com.linbit.linstor.core.apicallhandler.ScopeRunner:75
    call                                     N      reactor.core.publisher.MonoCallable:91
    trySubscribeScalarMap                    N      reactor.core.publisher.FluxFlatMap:126
    subscribeOrReturn                        N      reactor.core.publisher.MonoFlatMapMany:49
    subscribe                                N      reactor.core.publisher.Flux:8343
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2344
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
    subscribe                                N      reactor.core.publisher.MonoCurrentContext:35
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    complete                                 N      reactor.core.publisher.Operators$MonoSubscriber:1782
    onComplete                               N      reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber:121
    onComplete                               N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner:252
    checkTerminated                          N      reactor.core.publisher.FluxFlatMap$FlatMapMain:838
    drainLoop                                N      reactor.core.publisher.FluxFlatMap$FlatMapMain:600
    innerComplete                            N      reactor.core.publisher.FluxFlatMap$FlatMapMain:909
    onComplete                               N      reactor.core.publisher.FluxFlatMap$FlatMapInner:1013
    onComplete                               N      reactor.core.publisher.FluxMap$MapSubscriber:136
    onComplete                               N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2016
    complete                                 N      reactor.core.publisher.Operators:135
    subscribe                                N      reactor.core.publisher.FluxEmpty:41
    subscribe                                N      reactor.core.publisher.Flux:8357
    onError                                  N      reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber:97
    onError                                  N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2021
    error                                    N      reactor.core.publisher.FluxCreate$BaseSink:452
    drain                                    N      reactor.core.publisher.FluxCreate$BufferAsyncSink:781
    error                                    N      reactor.core.publisher.FluxCreate$BufferAsyncSink:726
    drainLoop                                N      reactor.core.publisher.FluxCreate$SerializedSink:229
    drain                                    N      reactor.core.publisher.FluxCreate$SerializedSink:205
    error                                    N      reactor.core.publisher.FluxCreate$SerializedSink:181
    connectionClosing                        N      com.linbit.linstor.netcom.TcpConnectorPeer:498
    closeConnection                          N      com.linbit.linstor.netcom.TcpConnectorService:1006
    closeAllConnections                      N      com.linbit.linstor.netcom.TcpConnectorService:1040
    uninitialize                             N      com.linbit.linstor.netcom.TcpConnectorService:1278
    run                                      N      com.linbit.linstor.netcom.TcpConnectorService:783
    run                                      N      java.lang.Thread:834

END OF ERROR REPORT.
kvaps commented 3 years ago

Ah it is just started working again out of blue sky

ghernadi commented 3 years ago

Are you really sure that you can now show now the same error report that did not work earlier?

There might be an issue if there is a Report that is quite large... I think the problematic limit is around 16MiB? (in one ErrorReport)

kvaps commented 3 years ago

Well, as you can see I was trying to get output of the list of errors. Possible the list was too big, but it is started working after a minute, not sure for what reason. :)

I'll try to investigate this if the problem will repeat

kvaps commented 3 years ago

The interesting fact that after this error, controller lost connections with the all satellites:

13:54:20.644 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'one-vm-8738-disk-0' of node 'm13c43' removed from retry.
14:03:23.041 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'one-vm-11373-disk-2' created.
14:39:27.536 [SslConnector] ERROR LINSTOR/Controller - SYSTEM - Target decrypted buffer is too small! [Report number 60194D5A-00000-000001]
15:20:48.626 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Resource 'one-vm-10468-disk-2' is still in use. [Report number 60194D5A-00000-000002]
16:01:30.482 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
16:01:44.462 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
16:01:54.151 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
16:02:05.017 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes

while linstor n l shows all nodes in Online but linstor sp l reurns many errors:

┊ thindata             ┊ m6c19     ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m6c2      ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m6c20     ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m6c3      ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m6c4      ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m6c5      ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m6c6      ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m6c7      ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
[...]
WARNING:
Description:
    No active connection to satellite 'm10c18'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'm10c19'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'm10c2'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
[...]

and auto-placing is not working:

# linstor rd auto-place one-vm-10242-disk-0
ERROR:
Description:
    Not enough available nodes
Details:
    Not enough nodes fulfilling the following auto-place criteria:
     * has a deployed storage pool named TransactionList [thindata]
     * the storage pools have to have at least '31457280' free space
     * the current access context has enough privileges to use the node and the storage pool
     * the node is online

    Auto-place configuration details:
      Additional place count: 1
      Replicas on different nodes: TransactionList [Aux/moonshot]
      Replicas on same nodes: TransactionList [Aux/opennebula-1]
      Don't place with resource (List): [one-vm-10242-disk-0]
      Storage pool name: TransactionList [thindata]

    Auto-placing resource: one-vm-10242-disk-0
# linstor err show 60194D5A-00000-000001
ERROR REPORT 60194D5A-00000-000001

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.11.0
Build ID:                           3367e32d0fa92515efe61f6963767700a8701d98
Build time:                         2020-12-22T17:52:40+00:00
Error time:                         2021-02-02 14:39:27
Node:                               linstor-controller-58d8c7956-hpswn

============================================================

Reported error:
===============

Category:                           Error
Class name:                         ImplementationError
Class canonical name:               com.linbit.ImplementationError
Generated at:                       Method 'unwrapInto', Source file 'SslTcpConnectorPeer.java', Line #241

Error message:                      Target decrypted buffer is too small!

Call backtrace:

    Method                                   Native Class:Line number
    unwrapInto                               N      com.linbit.linstor.netcom.ssl.SslTcpConnectorPeer:241
    readData                                 N      com.linbit.linstor.netcom.ssl.SslTcpConnectorPeer:287
    read                                     N      com.linbit.linstor.netcom.ssl.SslTcpConnectorPeer:195
    run                                      N      com.linbit.linstor.netcom.TcpConnectorService:550
    run                                      N      java.lang.Thread:834

END OF ERROR REPORT.
kvaps commented 3 years ago

I'm going to reopen this issue, because today situation repeated:

14:21:42.157 [SslConnector] ERROR LINSTOR/Controller - SYSTEM - Target decrypted buffer is too small! [Report number 601AB05C-00000-000003]
14:22:05.460 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-024ada25-8d28-4407-bb98-b7ada2900ac2' created.
14:22:05.545 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:23:09.856 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-9cec1356-83e4-48ae-a0d1-152735bbec9f' created.
14:23:09.937 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:23:09.938 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:27:06.453 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-024ada25-8d28-4407-bb98-b7ada2900ac2' created.
14:27:06.519 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:28:10.757 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-9cec1356-83e4-48ae-a0d1-152735bbec9f' created.
14:28:10.769 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-4469d552-b2f2-4373-9eb7-684b4439c1a0' created.
14:28:10.838 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:28:10.839 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:32:07.405 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-024ada25-8d28-4407-bb98-b7ada2900ac2' created.
14:32:07.484 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:33:11.633 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-9cec1356-83e4-48ae-a0d1-152735bbec9f' created.
14:33:11.655 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-4469d552-b2f2-4373-9eb7-684b4439c1a0' created.
14:33:11.729 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:33:11.730 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:37:08.403 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-024ada25-8d28-4407-bb98-b7ada2900ac2' created.
14:37:08.478 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:38:12.564 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-4469d552-b2f2-4373-9eb7-684b4439c1a0' created.
14:38:12.582 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-9cec1356-83e4-48ae-a0d1-152735bbec9f' created.
14:38:12.638 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:38:12.657 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:43.433 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-9cec1356-83e4-48ae-a0d1-152735bbec9f' created.
14:40:43.470 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-4469d552-b2f2-4373-9eb7-684b4439c1a0' created.
14:40:43.560 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-024ada25-8d28-4407-bb98-b7ada2900ac2' created.
14:40:43.570 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:43.603 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:43.670 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:45.438 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-9cec1356-83e4-48ae-a0d1-152735bbec9f' created.
14:40:45.497 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-4469d552-b2f2-4373-9eb7-684b4439c1a0' created.
14:40:45.499 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:45.517 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-024ada25-8d28-4407-bb98-b7ada2900ac2' created.
14:40:45.567 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:45.579 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:48.314 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-9cec1356-83e4-48ae-a0d1-152735bbec9f' created.
14:40:48.363 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:48.391 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-024ada25-8d28-4407-bb98-b7ada2900ac2' created.
14:40:48.401 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-4469d552-b2f2-4373-9eb7-684b4439c1a0' created.
14:40:48.528 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:48.586 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:52.975 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:53.103 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-9cec1356-83e4-48ae-a0d1-152735bbec9f' created.
14:40:53.137 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:40:53.201 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-024ada25-8d28-4407-bb98-b7ada2900ac2' created.
14:40:53.289 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:41:01.633 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:41:01.700 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:41:01.713 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-4469d552-b2f2-4373-9eb7-684b4439c1a0' created.
14:41:01.800 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:41:18.356 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-9cec1356-83e4-48ae-a0d1-152735bbec9f' created.
14:41:18.390 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-024ada25-8d28-4407-bb98-b7ada2900ac2' created.
14:41:18.400 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:41:18.429 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-4469d552-b2f2-4373-9eb7-684b4439c1a0' created.
14:41:18.523 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:41:18.536 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:41:51.312 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-9cec1356-83e4-48ae-a0d1-152735bbec9f' created.
14:41:51.323 [grizzly-http-server-1] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-4469d552-b2f2-4373-9eb7-684b4439c1a0' created.
14:41:51.386 [grizzly-http-server-0] INFO  LINSTOR/Controller - SYSTEM - New volume definition with number '0' of resource definition 'pvc-024ada25-8d28-4407-bb98-b7ada2900ac2' created.
14:41:51.387 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:41:51.411 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes
14:41:51.438 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Not enough available nodes

Tus after the Target decrypted buffer is too small error all storage-pools are become to Warning state:

┊ thindata             ┊ m14c43    ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m14c44    ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m14c45    ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m14c5     ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m14c6     ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m14c7     ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m14c8     ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m14c9     ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m15c1     ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
┊ thindata             ┊ m15c10    ┊ LVM_THIN ┊ data/thindata ┊              ┊               ┊ True         ┊ Warning ┊
[...]
WARNING:
Description:
    No active connection to satellite 'm9c30'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'm9c31'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'm9c32'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'm9c33'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'm9c34'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.

placing of new resource is not working anymore.

NeonSludge commented 2 years ago

This is still sometimes happening in 1.19.0. The controller starts complaining about broken connections to satellites in the linstor sp l view. At the same time, all nodes are Online.

WARNING:
Description:
    No active connection to satellite '...'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.

Error message and report:

17:21:55.973 [SslConnector] ERROR LINSTOR/Controller - SYSTEM - Target decrypted buffer is too small! [Report number 63287A29-00000-000000]
ERROR REPORT 63287A29-00000-000000

============================================================

Application:                        LINBIT® LINSTOR
Module:                             Controller
Version:                            1.19.0
Build ID:                           6c838b3897e5a00cff808f7c6402ff3b17aa3c1c
Build time:                         2022-07-20T06:46:57+00:00
Error time:                         2022-09-19 17:21:55
Node:                               ...

============================================================

Reported error:
===============

Category:                           Error
Class name:                         ImplementationError
Class canonical name:               com.linbit.ImplementationError
Generated at:                       Method 'unwrapInto', Source file 'SslTcpConnectorPeer.java', Line #241

Error message:                      Target decrypted buffer is too small!

Call backtrace:

    Method                                   Native Class:Line number
    unwrapInto                               N      com.linbit.linstor.netcom.ssl.SslTcpConnectorPeer:241
    readData                                 N      com.linbit.linstor.netcom.ssl.SslTcpConnectorPeer:287
    read                                     N      com.linbit.linstor.netcom.ssl.SslTcpConnectorPeer:195
    run                                      N      com.linbit.linstor.netcom.TcpConnectorService:543
    run                                      N      java.lang.Thread:750

END OF ERROR REPORT.

OS and JVM:

# cat /etc/redhat-release
AlmaLinux release 8.6 (Sky Tiger)

# java -version
openjdk version "1.8.0_332"
OpenJDK Runtime Environment (build 1.8.0_332-b09)
OpenJDK 64-Bit Server VM (build 25.332-b09, mixed mode)

We've noticed that this error becomes much easier to reproduce if something is regularly polling the /metrics endpoint on the controller. Restarting the controller usually fixes this issue right away.

duckhawk commented 1 year ago

We still have this problem on 1.23.0 controller. The same situation

# linstor error-reports show 64BF89A7-00000-000006
ERROR REPORT 64BF89A7-00000-000006

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.23.0
Build ID:                           28dbd33ced60d75a2a0562bf5e9bc6b800ae8361
Build time:                         2023-06-25T16:48:26+00:00
Error time:                         2023-07-25 08:46:45
Node:                               linstor-controller-76cbd795cf-5d4zp

============================================================

Reported error:
===============

Category:                           Error
Class name:                         ImplementationError
Class canonical name:               com.linbit.ImplementationError
Generated at:                       Method 'unwrapInto', Source file 'SslTcpConnectorPeer.java', Line #241

Error message:                      Target decrypted buffer is too small!

Call backtrace:

    Method                                   Native Class:Line number
    unwrapInto                               N      com.linbit.linstor.netcom.ssl.SslTcpConnectorPeer:241
    readData                                 N      com.linbit.linstor.netcom.ssl.SslTcpConnectorPeer:287
    read                                     N      com.linbit.linstor.netcom.ssl.SslTcpConnectorPeer:195
    run                                      N      com.linbit.linstor.netcom.TcpConnectorService:543
    run                                      N      java.lang.Thread:829

END OF ERROR REPORT.

And some faulty resources (unknown state). Restart of linstor controller had no effect.

08:45:50.261 [MainWorkerPool-4] INFO  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'pvc-6b14a3ba-fb4a-4085-9c76-5264d9fc7b6a' of node 'ekb99-a-dhw05' removed from retry.
08:45:50.506 [MainWorkerPool-3] INFO  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'pvc-dd629fb3-18e5-4f90-8d6c-43eea1df1a78' of node 'ekb99-a-dhw08' removed from retry.
08:45:50.555 [MainWorkerPool-5] INFO  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'pvc-324463d7-70b4-4def-a1fb-b800afd2aaa8' of node 'ekb99-a-dhw06' removed from retry.
08:46:45.245 [SslConnector] ERROR LINSTOR/Controller - SYSTEM - Target decrypted buffer is too small! [Report number 64BF89A7-00000-000006]

08:48:25.058 [grizzly-http-server-0] WARN  LINSTOR/Controller - SYSTEM - Resource definition: pvc-5ba9f49f-8159-4713-96bc-83c14a8ddcc3, Volume number: 0 not found.
08:48:25.098 [grizzly-http-server-7] ERROR LINSTOR/Controller - SYSTEM - Resource 'pvc-5ba9f49f-8159-4713-96bc-83c14a8ddcc3' is still in use. [Report number 64BF89A7-00000-000007]

08:50:22.801 [grizzly-http-server-9] WARN  LINSTOR/Controller - SYSTEM - Resource definition: pvc-e2b38290-00d6-4b93-8da7-911a8474ad5e, Volume number: 0 not found.
08:50:22.837 [grizzly-http-server-6] ERROR LINSTOR/Controller - SYSTEM - Resource 'pvc-e2b38290-00d6-4b93-8da7-911a8474ad5e' is still in use. [Report number 64BF89A7-00000-000008]
ghernadi commented 1 year ago

That is strange. I would have expected restarting the controller to "fix" this issue (at least for a while). Can you send me an sos-report? (please find my email in my profile here) Maybe I can find something that could give us a clue.

And as always, if you by any chance find a way to reproduce this issue, that would be very helpful.

AleksZimin commented 11 months ago

Hello! Today we noticed the same behavior in the linstor-controller as when the following error occurs: [SslConnector] ERROR LINSTOR/Controller - SYSTEM - Target decrypted buffer is too small!

Symptoms: On one of the clusters, one of the linstor-satellites is OFFLINE, even though both the pod and the node were rebooted. The output of the linstor node list looked like this:

linstor node list
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node                               ┊ NodeType   ┊ Addresses               ┊ State                                        ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ ingress-0                          ┊ SATELLITE  ┊ 10.5.0.23:3367 (SSL)    ┊ Online                                       ┊
┊ ingress-1                          ┊ SATELLITE  ┊ 10.5.0.29:3367 (SSL)    ┊ Online                                       ┊
┊ linstor-controller-fb6b746dd-2pr8f ┊ CONTROLLER ┊ 10.111.1.22:3367 (SSL)  ┊ OFFLINE (Auto-eviction: Disabled)            ┊
┊ linstor-controller-fb6b746dd-s86dl ┊ CONTROLLER ┊ 10.111.7.241:3367 (SSL) ┊ Online                                       ┊
┊ master-0                           ┊ SATELLITE  ┊ 10.5.0.20:3367 (SSL)    ┊ Online                                       ┊
┊ master-1                           ┊ SATELLITE  ┊ 10.5.0.21:3367 (SSL)    ┊ OFFLINE (Auto-eviction: 2023-10-12 10:36:49) ┊
┊ master-2                           ┊ SATELLITE  ┊ 10.5.0.22:3367 (SSL)    ┊ Online                                       ┊
┊ monitoring-0                       ┊ SATELLITE  ┊ 10.5.0.15:3367 (SSL)    ┊ Online                                       ┊
┊ monitoring-1                       ┊ SATELLITE  ┊ 10.5.0.16:3367 (SSL)    ┊ Online                                       ┊
┊ node-0                             ┊ SATELLITE  ┊ 10.5.0.24:3367 (SSL)    ┊ Online                                       ┊
┊ node-1                             ┊ SATELLITE  ┊ 10.5.0.25:3367 (SSL)    ┊ Online                                       ┊
┊ node-2                             ┊ SATELLITE  ┊ 10.5.0.26:3367 (SSL)    ┊ Online                                       ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

There was nothing unusual in the logs of linstor-satellite:

kubectl -n d8-linstor logs linstor-node-s2wfr                                                                                                       

Defaulted container "linstor-satellite" out of: linstor-satellite, kube-rbac-proxy, drbd-prometheus-exporter
time="2023-10-12T10:42:05Z" level=info msg="running k8s-await-election" version=v0.3.1
time="2023-10-12T10:42:05Z" level=info msg="not running with leader election"
time="2023-10-12T10:42:05Z" level=info msg="starting command '/usr/bin/piraeus-entry.sh' with arguments: '[startSatellite]'"
Importing keystore /tmp/tmp.Gwas4bfdMz to /etc/linstor/ssl/keystore.jks...
Entry for alias linstor successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled
Certificate was added to keystore
LINSTOR, Module Satellite
Version:            1.24.2 (60f9b2aa34ad16f7a0eb5ad8c8e306a2616ae061)
Build time:         2023-10-11T07:48:43+00:00
Java Version:       11
Java VM:            Ubuntu, Version 11.0.20.1+1-post-Ubuntu-0ubuntu122.04
Operating system:   Linux, Version 5.4.0-54-generic
Environment:        amd64, 4 processors, 4004 MiB memory reserved for allocations

System components initialization in progress

Loading configuration file "/etc/linstor/linstor_satellite.toml"
10:42:09.346 [main] INFO  LINSTOR/Satellite - SYSTEM - ErrorReporter DB first time init.
10:42:09.348 [main] INFO  LINSTOR/Satellite - SYSTEM - Log directory set to: '/var/log/linstor-satellite'
10:42:11.415 [Main] INFO  LINSTOR/Satellite - SYSTEM - Loading API classes started.
10:42:11.829 [Main] INFO  LINSTOR/Satellite - SYSTEM - API classes loading finished: 414ms
10:42:11.830 [Main] INFO  LINSTOR/Satellite - SYSTEM - Dependency injection started.
10:42:11.852 [Main] INFO  LINSTOR/Satellite - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
10:42:11.853 [Main] INFO  LINSTOR/Satellite - SYSTEM - Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
10:42:11.853 [Main] INFO  LINSTOR/Satellite - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
10:42:11.860 [Main] INFO  LINSTOR/Satellite - SYSTEM - Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
10:42:12.670 [Main] INFO  LINSTOR/Satellite - SYSTEM - Dependency injection finished: 840ms
10:42:12.670 [Main] INFO  LINSTOR/Satellite - SYSTEM - Cryptography provider: Using default cryptography module
10:42:14.723 [Main] INFO  LINSTOR/Satellite - SYSTEM - Initializing main network communications service
10:42:14.723 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
10:42:14.724 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'FileEventService' of type FileEventService
10:42:14.724 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'DrbdEventService-1' of type DrbdEventService
10:42:14.727 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'DrbdEventPublisher-1' of type DrbdEventPublisher
10:42:14.727 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'SnapshotShippingService' of type SnapshotShippingService
10:42:14.727 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'BackupShippingL2LService' of type BackupShippingL2LService
10:42:14.727 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'BackupShippingS3Service' of type BackupShippingS3Service
10:42:14.727 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'DeviceManager' of type DeviceManager
10:42:14.728 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'CloneService' of type CloneService
10:42:14.870 [Main] INFO  LINSTOR/Satellite - SYSTEM - SSLNetComService started on port /0:0:0:0:0:0:0:0:3367

We noticed that the sos report has outdated DRBD version data (version 9.2.4) for nodes master-0 and master-2 (there was no data for master-1 at all), although at the time of the sos report formation, the version of DRBD was already 9.2.5.

The command linstor sp l returned a connection error to the nodes:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| StoragePool          | Node         | Driver   | PoolName                   | FreeCapacity | TotalCapacity | CanSnapshots | State   | SharedName                        |
|=========================================================================================================================================================================|
| DfltDisklessStorPool | ingress-0    | DISKLESS |                            |              |               | False        | Warning | ingress-0;DfltDisklessStorPool    |
| DfltDisklessStorPool | ingress-1    | DISKLESS |                            |              |               | False        | Warning | ingress-1;DfltDisklessStorPool    |
| DfltDisklessStorPool | master-0     | DISKLESS |                            |              |               | False        | Warning | master-0;DfltDisklessStorPool     |
| DfltDisklessStorPool | master-1     | DISKLESS |                            |              |               | False        | Warning | master-1;DfltDisklessStorPool     |
| DfltDisklessStorPool | master-2     | DISKLESS |                            |              |               | False        | Warning | master-2;DfltDisklessStorPool     |
| DfltDisklessStorPool | monitoring-0 | DISKLESS |                            |              |               | False        | Warning | monitoring-0;DfltDisklessStorPool |
| DfltDisklessStorPool | monitoring-1 | DISKLESS |                            |              |               | False        | Warning | monitoring-1;DfltDisklessStorPool |
| DfltDisklessStorPool | node-0       | DISKLESS |                            |              |               | False        | Warning | node-0;DfltDisklessStorPool       |
| DfltDisklessStorPool | node-1       | DISKLESS |                            |              |               | False        | Warning | node-1;DfltDisklessStorPool       |
| DfltDisklessStorPool | node-2       | DISKLESS |                            |              |               | False        | Warning | node-2;DfltDisklessStorPool       |
| data                 | node-0       | LVM      | data_project               |              |               | False        | Warning | node-0;data                       |
| data                 | node-1       | LVM      | data_project               |              |               | False        | Warning | node-1;data                       |
| data                 | node-2       | LVM      | data_project               |              |               | False        | Warning | node-2;data                       |
| monitoring           | monitoring-0 | LVM      | d8_monitoring              |              |               | False        | Warning | monitoring-0;monitoring           |
| monitoring           | monitoring-1 | LVM      | d8_monitoring              |              |               | False        | Warning | monitoring-1;monitoring           |
| thindata             | node-0       | LVM_THIN | data_project_thin/thindata |              |               | True         | Warning | node-0;thindata                   |
| thindata             | node-1       | LVM_THIN | data_project_thin/thindata |              |               | True         | Warning | node-1;thindata                   |
| thindata             | node-2       | LVM_THIN | data_project_thin/thindata |              |               | True         | Warning | node-2;thindata                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
WARNING:
Description:
    No active connection to satellite 'node-0'
Details:

    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'node-1'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'node-2'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'ingress-0'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'ingress-1'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'master-0'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'master-1'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'master-2'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'monitoring-0'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'monitoring-1'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.

It appears that the linstor-controller simply stopped collecting any data from the linstor-satellite. There was such an error in the controller logs:

2023_10_12 09:37:19.697 [SslConnector] ERROR LINSTOR/Controller - SYSTEM - Problem of type 'java.lang.NullPointerException' logged to report number 6526B03E-00000-000003

2023_10_12 09:37:19.699 [SslConnector] ERROR reactor.core.publisher.Operators - Operator called default onErrorDropped
reactor.core.Exceptions$ErrorCallbackNotImplemented: com.linbit.linstor.netcom.PeerClosingConnectionException
Caused by: com.linbit.linstor.netcom.PeerClosingConnectionException: null
    at com.linbit.linstor.netcom.TcpConnectorPeer.connectionClosing(TcpConnectorPeer.java:526)
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
    *__checkpoint ? Fetch thin capacity info
Original Stack Trace:
        at com.linbit.linstor.netcom.TcpConnectorPeer.connectionClosing(TcpConnectorPeer.java:526)
        at com.linbit.linstor.netcom.TcpConnectorService.closeConnection(TcpConnectorService.java:999)
        at com.linbit.linstor.netcom.TcpConnectorService.closeAllConnections(TcpConnectorService.java:1034)
        at com.linbit.linstor.netcom.TcpConnectorService.uninitialize(TcpConnectorService.java:1287)
        at com.linbit.linstor.netcom.TcpConnectorService.run(TcpConnectorService.java:777)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023_10_12 09:37:19.702 [SslConnector] ERROR reactor.core.publisher.Operators - Operator called default onErrorDropped
reactor.core.Exceptions$ErrorCallbackNotImplemented: com.linbit.linstor.netcom.PeerClosingConnectionException
Caused by: com.linbit.linstor.netcom.PeerClosingConnectionException: null
    at com.linbit.linstor.netcom.TcpConnectorPeer.connectionClosing(TcpConnectorPeer.java:526)
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
    *__checkpoint ? Fetch volume allocated
Original Stack Trace:
        at com.linbit.linstor.netcom.TcpConnectorPeer.connectionClosing(TcpConnectorPeer.java:526)
        at com.linbit.linstor.netcom.TcpConnectorService.closeConnection(TcpConnectorService.java:999)
        at com.linbit.linstor.netcom.TcpConnectorService.closeAllConnections(TcpConnectorService.java:1034)
        at com.linbit.linstor.netcom.TcpConnectorService.uninitialize(TcpConnectorService.java:1287)
        at com.linbit.linstor.netcom.TcpConnectorService.run(TcpConnectorService.java:777)
        at java.base/java.lang.Thread.run(Thread.java:829)

In the error report, there was:

ERROR REPORT 6526B03E-00000-000003

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.24.2
Build ID:                           60f9b2aa34ad16f7a0eb5ad8c8e306a2616ae061
Build time:                         2023-10-11T07:48:43+00:00
Error time:                         2023-10-12 09:37:19
Node:                               linstor-controller-fb6b746dd-s86dl

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'establishConnection', Source file 'TcpConnectorService.java', Line #961

Call backtrace:

    Method                                   Native Class:Line number
    establishConnection                      N      com.linbit.linstor.netcom.TcpConnectorService:961
    run                                      N      com.linbit.linstor.netcom.TcpConnectorService:708
    run                                      N      java.lang.Thread:829

END OF ERROR REPORT.

Restarting the linstor-controller resolves the issue. After restarting the linstor-controller, all nodes went to Online state, and the command linstor sp l executed without errors.

I have sent sos report to @ghernadi