Linstor controller crashlooping

dimm0 commented 1 year ago

Controller logs:

run-migration time="2023-08-02T23:19:13Z" level=info msg="running k8s-await-election" version=refs/tags/v0.3.1
run-migration time="2023-08-02T23:19:13Z" level=info msg="no status endpoint specified, will not be created"
run-migration I0802 23:19:13.793308       1 leaderelection.go:248] attempting to acquire leader lease piraeus-datastore/linstor-controller...
run-migration time="2023-08-02T23:19:13Z" level=info msg="long live our new leader: 'linstor-controller-5557d9ccb4-v6whg'!"
run-migration E0802 23:35:43.482809       1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "linstor-controller": the object has been modified; please apply your changes to the latest version and try again
run-migration time="2023-08-02T23:35:46Z" level=info msg="long live our new leader: 'linstor-controller-5557d9ccb4-45vwn'!"
run-migration I0802 23:58:39.448503       1 leaderelection.go:258] successfully acquired lease piraeus-datastore/linstor-controller
run-migration time="2023-08-02T23:58:39Z" level=info msg="long live our new leader: 'linstor-controller-5557d9ccb4-cr9cm'!"
run-migration time="2023-08-02T23:58:39Z" level=info msg="starting command '/usr/bin/piraeus-entry.sh' with arguments: '[runMigration]'"
run-migration Loading configuration file "/etc/linstor/linstor.toml"
run-migration INFO:    Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
run-migration INFO:    Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
run-migration INFO:    Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
run-migration DEBUG:   Constructing instance of module "com.linbit.linstor.modularcrypto.JclCryptoModule" with default constructor
run-migration INFO:    Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
run-migration INFO:    Cryptography provider: Using default cryptography module
run-migration INFO:    Kubernetes-CRD connection URL is "k8s"
run-migration 23:58:40.189 [main] DEBUG io.fabric8.kubernetes.client.Config - Trying to configure client from Kubernetes config...
run-migration 23:58:40.191 [main] DEBUG io.fabric8.kubernetes.client.Config - Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
run-migration 23:58:40.192 [main] DEBUG io.fabric8.kubernetes.client.Config - Trying to configure client from service account...
run-migration 23:58:40.192 [main] DEBUG io.fabric8.kubernetes.client.Config - Found service account host and port: 10.96.0.1:443
run-migration 23:58:40.192 [main] DEBUG io.fabric8.kubernetes.client.Config - Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt}].
run-migration 23:58:40.192 [main] DEBUG io.fabric8.kubernetes.client.Config - Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
run-migration 23:58:40.192 [main] DEBUG io.fabric8.kubernetes.client.Config - Trying to configure client namespace from Kubernetes service account namespace path...
run-migration 23:58:40.192 [main] DEBUG io.fabric8.kubernetes.client.Config - Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
run-migration TRACE:   Found database version 11
run-migration no migration needed
Stream closed EOF for piraeus-datastore/linstor-controller-5557d9ccb4-cr9cm (run-migration)
linstor-controller time="2023-08-03T00:55:32Z" level=info msg="running k8s-await-election" version=refs/tags/v0.3.1
linstor-controller time="2023-08-03T00:55:32Z" level=info msg="no status endpoint specified, will not be created"
linstor-controller I0803 00:55:32.210816       1 leaderelection.go:248] attempting to acquire leader lease piraeus-datastore/linstor-controller...
linstor-controller I0803 00:55:32.501090       1 leaderelection.go:258] successfully acquired lease piraeus-datastore/linstor-controller
linstor-controller time="2023-08-03T00:55:32Z" level=info msg="long live our new leader: 'linstor-controller-5557d9ccb4-cr9cm'!"
linstor-controller time="2023-08-03T00:55:32Z" level=info msg="starting command '/usr/bin/piraeus-entry.sh' with arguments: '[startController]'"
linstor-controller LINSTOR, Module Controller
linstor-controller Version:            1.23.0 (28dbd33ced60d75a2a0562bf5e9bc6b800ae8361)
linstor-controller Build time:         2023-05-23T06:27:14+00:00
linstor-controller Java Version:       11
linstor-controller Java VM:            Debian, Version 11.0.18+10-post-Debian-1deb11u1
linstor-controller Operating system:   Linux, Version 5.4.0-155-generic
linstor-controller Environment:        amd64, 128 processors, 1024 MiB memory reserved for allocations
linstor-controller 
linstor-controller 
linstor-controller System components initialization in progress
linstor-controller 
linstor-controller Loading configuration file "/etc/linstor/linstor.toml"
linstor-controller 00:55:33.461 [main] INFO  LINSTOR/Controller - SYSTEM - ErrorReporter DB version 1 found.
linstor-controller 00:55:33.463 [main] INFO  LINSTOR/Controller - SYSTEM - Log directory set to: '/var/log/linstor-controller'
linstor-controller 00:55:33.502 [main] INFO  LINSTOR/Controller - SYSTEM - Database type is Kubernetes-CRD
linstor-controller 00:55:33.503 [Main] INFO  LINSTOR/Controller - SYSTEM - Loading API classes started.
linstor-controller 00:55:33.920 [Main] INFO  LINSTOR/Controller - SYSTEM - API classes loading finished: 416ms
linstor-controller 00:55:33.920 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection started.
linstor-controller 00:55:33.929 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
linstor-controller 00:55:33.930 [Main] INFO  LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
linstor-controller 00:55:33.930 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
linstor-controller 00:55:33.936 [Main] INFO  LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
linstor-controller 00:55:33.936 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule"
linstor-controller 00:55:33.936 [Main] INFO  LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" was successful
linstor-controller 00:55:34.799 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection finished: 879ms
linstor-controller 00:55:34.800 [Main] INFO  LINSTOR/Controller - SYSTEM - Cryptography provider: Using default cryptography module
linstor-controller 00:55:35.007 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing authentication subsystem
linstor-controller 00:55:35.327 [Main] INFO  LINSTOR/Controller - SYSTEM - SpaceTracking using K8sCrd driver
linstor-controller 00:55:35.331 [Main] INFO  LINSTOR/Controller - SYSTEM - SpaceTrackingService: Instance added as a system service
linstor-controller 00:55:35.332 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
linstor-controller 00:55:35.333 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing the k8s crd database connector
linstor-controller 00:55:35.333 [Main] INFO  LINSTOR/Controller - SYSTEM - Kubernetes-CRD connection URL is "k8s"
linstor-controller 00:55:37.554 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'K8sCrdDatabaseService' of type K8sCrdDatabaseService
linstor-controller 00:55:37.561 [Main] INFO  LINSTOR/Controller - SYSTEM - Loading security objects
linstor-controller 00:55:38.113 [Main] INFO  LINSTOR/Controller - SYSTEM - Current security level is NO_SECURITY
linstor-controller 00:55:39.128 [Main] INFO  LINSTOR/Controller - SYSTEM - Core objects load from database is in progress
linstor-controller 00:55:41.391 [Main] INFO  LINSTOR/Controller - SYSTEM - Core objects load from database completed
linstor-controller 00:55:51.653 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'TaskScheduleService' of type TaskScheduleService
linstor-controller 00:55:51.655 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing network communications services
linstor-controller 00:55:51.655 [Main] WARN  LINSTOR/Controller - SYSTEM - The SSL network communication service 'DebugSslConnector' could not be started because the keyStore file (/etc/linstor/ssl/keystore.jks) is missing
linstor-controller 00:55:51.660 [Main] INFO  LINSTOR/Controller - SYSTEM - Created network communication service 'PlainConnector'
linstor-controller 00:55:51.660 [Main] WARN  LINSTOR/Controller - SYSTEM - The SSL network communication service 'SslConnector' could not be started because the keyStore file (/etc/linstor/ssl/keystore.jks) is missing
linstor-controller 00:55:51.660 [Main] INFO  LINSTOR/Controller - SYSTEM - Created network communication service 'SslConnector'
linstor-controller 00:55:51.660 [Main] INFO  LINSTOR/Controller - SYSTEM - Reconnecting to previously known nodes
linstor-controller 00:55:51.904 [Main] ERROR LINSTOR/Controller - SYSTEM - Cannot connect to satellite [Report number 64CAFB04-00000-000000]
linstor-controller 
linstor-controller 00:55:52.064 [Main] ERROR LINSTOR/Controller - SYSTEM - Cannot connect to satellite [Report number 64CAFB04-00000-000001]
linstor-controller 
linstor-controller 00:55:52.069 [Main] ERROR LINSTOR/Controller - SYSTEM - Cannot connect to satellite [Report number 64CAFB04-00000-000002]
linstor-controller 
linstor-controller 00:55:52.078 [Main] ERROR LINSTOR/Controller - SYSTEM - Cannot connect to satellite [Report number 64CAFB04-00000-000003]
linstor-controller 
linstor-controller 00:55:52.187 [Main] INFO  LINSTOR/Controller - SYSTEM - Reconnect requests sent
linstor-controller 00:55:52.189 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'SpaceTrackingService' of type SpaceTrackingService
linstor-controller 00:55:52.190 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'ScheduleBackupService' of type ScheduleBackupService
linstor-controller 00:55:52.192 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'EbsStatusPoll' of type EbsStatusPoll
linstor-controller WARNING: An illegal reflective access operation has occurred
linstor-controller WARNING: Illegal reflective access by com.sun.xml.bind.v2.runtime.reflect.opt.Injector (file:/usr/share/linstor-server/lib/jaxb-impl-2.2.11.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int)
linstor-controller WARNING: Please consider reporting this to the maintainers of com.sun.xml.bind.v2.runtime.reflect.opt.Injector
linstor-controller WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
linstor-controller WARNING: All illegal access operations will be denied in a future release
linstor-controller Aug 03, 2023 12:55:53 AM org.glassfish.grizzly.http.server.NetworkListener start
linstor-controller INFO: Started listener bound to [:3370]
linstor-controller Aug 03, 2023 12:55:53 AM org.glassfish.grizzly.http.server.HttpServer start
linstor-controller INFO: [HttpServer] Started.
linstor-controller 00:55:53.423 [Main] WARN  LINSTOR/Controller - SYSTEM - Not calling 'systemd-notify' as NOTIFY_SOCKET is null
linstor-controller 00:55:53.423 [Main] INFO  LINSTOR/Controller - SYSTEM - Controller initialized
linstor-controller 
linstor-controller 00:55:53.639 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - LogArchive: Running log archive on directory: /var/log/linstor-controller
linstor-controller 00:55:53.647 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - LogArchive: No logs to archive.
linstor-controller 00:56:19.214 [grizzly-http-server-41] INFO  LINSTOR/Controller - SYSTEM - No common DRBD verify algorithm found for 'pvc-585b40f5-9681-4c77-b295-c3230e832307', clearing prop
linstor-controller 00:56:19.814 [grizzly-http-server-41] ERROR LINSTOR/Controller - SYSTEM - Exceptions have been converted to responses [Report number 64CAFB04-00000-000004]
linstor-controller 
linstor-controller 00:57:52.748 [SpaceTrackingService] ERROR LINSTOR/Controller - SYSTEM - Uncaught exception in k [Report number 64CAFB04-00000-000005]
linstor-controller 
linstor-controller 00:58:06.976 [grizzly-http-server-103] INFO  LINSTOR/Controller - SYSTEM - No common DRBD verify algorithm found for 'pvc-585b40f5-9681-4c77-b295-c3230e832307', clearing prop
linstor-controller 00:58:07.541 [grizzly-http-server-103] ERROR LINSTOR/Controller - SYSTEM - Exceptions have been converted to responses [Report number 64CAFB04-00000-000006]
linstor-controller 
Stream closed EOF for piraeus-datastore/linstor-controller-5557d9ccb4-cr9cm (linstor-controller)

root@linstor-controller-5557d9ccb4-cr9cm:/# linstor error-reports show 64CAFA61-00000-000004
ERROR REPORT 64CAFA61-00000-000004

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.23.0
Build ID:                           28dbd33ced60d75a2a0562bf5e9bc6b800ae8361
Build time:                         2023-05-23T06:27:14+00:00
Error time:                         2023-08-03 00:55:08
Node:                               linstor-controller-5557d9ccb4-cr9cm

============================================================

Reported error:
===============

Category:                           Error
Class name:                         ImplementationError
Class canonical name:               com.linbit.ImplementationError
Generated at:                       Method 'run', Source file 'SpaceTrackingTask.java', Line #300

Error message:                      Uncaught exception in k

Call backtrace:

    Method                                   Native Class:Line number
    run                                      N      com.linbit.linstor.spacetracking.k:300
    run                                      N      java.lang.Thread:829

Caused by:
==========

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'a', Source file 'SpaceTrackingApiCallHandler.java', Line #108

Call backtrace:

    Method                                   Native Class:Line number
    a                                        N      com.linbit.linstor.core.apicallhandler.controller.internal.a:108
    a                                        N      com.linbit.linstor.core.apicallhandler.controller.internal.a:80
    a                                        N      com.linbit.linstor.spacetracking.k:884
    c                                        N      com.linbit.linstor.spacetracking.k:548
    run                                      N      com.linbit.linstor.spacetracking.k:269
    run                                      N      java.lang.Thread:829

END OF ERROR REPORT.

dimm0 commented 1 year ago

https://github.com/piraeusdatastore/piraeus-operator/issues/512

ghernadi commented 1 year ago

Thank you. I have found and fixed the bug causing the NullPointerException.

However, said exception might kill the thread that collects data for SpaceTracking, which is not ideal I agree on that, but I do not see reasons why the entire controller should be killed from this exception.

I do see the last log line, but can you please additionally check for additional logs or more ErrorReports? Or is it possible that the machine simply runs out of memory?

dimm0 commented 1 year ago

There's no last line, I think it's getting killed before the line is flushed. And no, it has plenty of memory

Although I think I saw some errors about getting disconnected from peers.

@WanzenBug could you comment on why the controller is killed when there are errors? I think I've seen a similar behavior when it couldn't resize a volume (https://github.com/piraeusdatastore/piraeus-operator/issues/345)

WanzenBug commented 1 year ago

It should not get killed. The only reason would be if either the liveness probe fails (unlikely, as the probe is not even using the LINSTOR API), or if the LINSTOR Controller exits.

WanzenBug commented 1 year ago

To get more logs, you can use a strategy described here: https://github.com/piraeusdatastore/piraeus-operator/issues/184#issuecomment-851250374

dimm0 commented 1 year ago

Ok, I figured the problem. The controller is killed because the livenessProbe is failing, and it's failing because the health check returns "Services not running: SpaceTrackingService"

dimm0 commented 1 year ago

Another error now:

ERROR REPORT 64CBFA4E-00000-000024

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.23.0
Build ID:                           28dbd33ced60d75a2a0562bf5e9bc6b800ae8361
Build time:                         2023-05-23T06:27:14+00:00
Error time:                         2023-08-03 19:20:48
Node:                               linstor-controller-58c7d99c94-9rqws
Peer:                               RestClient(10.244.193.250; 'linstor-csi/v1.1.0-20969df70962927a06cbdf714e9ca8cc3912cb4d')

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'listAvailableStorPools', Source file 'StorPoolFilter.java', Line #106

Error context:
    Registration of resource 'pvc-c96fe2dd-6119-4677-bf6a-3985056b04f9' on node rci-nrp-gpu-03.sdsu.edu failed due to an unknown exception.

Asynchronous stage backtrace:

    Error has been observed at the following site(s):
        |_ checkpoint ? Place anywhere on node
    Stack trace:

Call backtrace:

    Method                                   Native Class:Line number
    listAvailableStorPools                   N      com.linbit.linstor.core.apicallhandler.controller.autoplacer.StorPoolFilter:106

Suppressed exception 1 of 1:
===============
Category:                           RuntimeException
Class name:                         OnAssemblyException
Class canonical name:               reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at:                       Method 'listAvailableStorPools', Source file 'StorPoolFilter.java', Line #106

Error message:
Error has been observed at the following site(s):
    |_ checkpoint ��� Place anywhere on node
Stack trace:

Error context:
    Registration of resource 'pvc-c96fe2dd-6119-4677-bf6a-3985056b04f9' on node rci-nrp-gpu-03.sdsu.edu failed due to an unknown exception.

Call backtrace:

    Method                                   Native Class:Line number
    listAvailableStorPools                   N      com.linbit.linstor.core.apicallhandler.controller.autoplacer.StorPoolFilter:106
    autoPlace                                N      com.linbit.linstor.core.apicallhandler.controller.autoplacer.Autoplacer:74
    placeAnywhereInTransaction               N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscMakeAvailableApiCallHandler:699
    lambda$placeAnywhere$9                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscMakeAvailableApiCallHandler:554
    doInScope                                N      com.linbit.linstor.core.apicallhandler.ScopeRunner:149
    lambda$fluxInScope$0                     N      com.linbit.linstor.core.apicallhandler.ScopeRunner:76
    call                                     N      reactor.core.publisher.MonoCallable:91
    trySubscribeScalarMap                    N      reactor.core.publisher.FluxFlatMap:126
    subscribeOrReturn                        N      reactor.core.publisher.MonoFlatMapMany:49
    subscribe                                N      reactor.core.publisher.Flux:8343
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2344
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
    subscribe                                N      reactor.core.publisher.MonoCurrentContext:35
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2344
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
    subscribe                                N      reactor.core.publisher.MonoCurrentContext:35
    subscribe                                N      reactor.core.publisher.Flux:8357
    trySubscribeScalarMap                    N      reactor.core.publisher.FluxFlatMap:199
    subscribeOrReturn                        N      reactor.core.publisher.MonoFlatMapMany:49
    subscribe                                N      reactor.core.publisher.Flux:8343
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2344
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
    subscribe                                N      reactor.core.publisher.MonoCurrentContext:35
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    onNext                                   N      reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber:121
    complete                                 N      reactor.core.publisher.Operators$MonoSubscriber:1782
    onComplete                               N      reactor.core.publisher.MonoCollect$CollectSubscriber:152
    onComplete                               N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner:252
    checkTerminated                          N      reactor.core.publisher.FluxFlatMap$FlatMapMain:838
    drainLoop                                N      reactor.core.publisher.FluxFlatMap$FlatMapMain:600
    drain                                    N      reactor.core.publisher.FluxFlatMap$FlatMapMain:580
    onComplete                               N      reactor.core.publisher.FluxFlatMap$FlatMapMain:457
    checkTerminated                          N      reactor.core.publisher.FluxFlatMap$FlatMapMain:838
    drainLoop                                N      reactor.core.publisher.FluxFlatMap$FlatMapMain:600
    innerComplete                            N      reactor.core.publisher.FluxFlatMap$FlatMapMain:909
    onComplete                               N      reactor.core.publisher.FluxFlatMap$FlatMapInner:1013
    onComplete                               N      reactor.core.publisher.FluxMap$MapSubscriber:136
    onComplete                               N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2016
    onComplete                               N      reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber:78
    complete                                 N      reactor.core.publisher.FluxCreate$BaseSink:438
    drain                                    N      reactor.core.publisher.FluxCreate$BufferAsyncSink:784
    complete                                 N      reactor.core.publisher.FluxCreate$BufferAsyncSink:732
    drainLoop                                N      reactor.core.publisher.FluxCreate$SerializedSink:239
    drain                                    N      reactor.core.publisher.FluxCreate$SerializedSink:205
    complete                                 N      reactor.core.publisher.FluxCreate$SerializedSink:196
    apiCallComplete                          N      com.linbit.linstor.netcom.TcpConnectorPeer:470
    handleComplete                           N      com.linbit.linstor.proto.CommonMessageProcessor:363
    handleDataMessage                        N      com.linbit.linstor.proto.CommonMessageProcessor:287
    doProcessInOrderMessage                  N      com.linbit.linstor.proto.CommonMessageProcessor:235
    lambda$doProcessMessage$3                N      com.linbit.linstor.proto.CommonMessageProcessor:220
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.FluxFlatMap$FlatMapMain:418
    drainAsync                               N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:414
    drain                                    N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:679
    onNext                                   N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:243
    drainFused                               N      reactor.core.publisher.UnicastProcessor:286
    drain                                    N      reactor.core.publisher.UnicastProcessor:329
    onNext                                   N      reactor.core.publisher.UnicastProcessor:408
    next                                     N      reactor.core.publisher.FluxCreate$IgnoreSink:618
    drainLoop                                N      reactor.core.publisher.FluxCreate$SerializedSink:248
    next                                     N      reactor.core.publisher.FluxCreate$SerializedSink:168
    processInOrder                           N      com.linbit.linstor.netcom.TcpConnectorPeer:388
    doProcessMessage                         N      com.linbit.linstor.proto.CommonMessageProcessor:218
    lambda$processMessage$2                  N      com.linbit.linstor.proto.CommonMessageProcessor:164
    onNext                                   N      reactor.core.publisher.FluxPeek$PeekSubscriber:177
    runAsync                                 N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:439
    run                                      N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:526
    call                                     N      reactor.core.scheduler.WorkerTask:84
    call                                     N      reactor.core.scheduler.WorkerTask:37
    run                                      N      java.util.concurrent.FutureTask:264
    run                                      N      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
    runWorker                                N      java.util.concurrent.ThreadPoolExecutor:1128
    run                                      N      java.util.concurrent.ThreadPoolExecutor$Worker:628
    run                                      N      java.lang.Thread:829

END OF ERROR REPORT.

dimm0 commented 1 year ago

Could I please get the image with first error fixed at least?

dimm0 commented 1 year ago

Still stuck and can't use the storage... Help please

ghernadi commented 1 year ago

The fix for the first exception Uncaught exception in k was released with v1.24.0.

For the second exception, I'm still not sure how you can get a null reference there. Would you mind sending me a database dump to my email address (see my profile) so that I can investigate a bit further?

dimm0 commented 1 year ago

The fix for the first exception Uncaught exception in k was released with v1.24.0.

Ah, I missed it. Thanks!

For the second exception, I'm still not sure how you can get a null reference there. Would you mind sending me a database dump to my email address (see my profile) so that I can investigate a bit further?

Do you mean the kubernetes objects? All of them? As yamls?

ghernadi commented 1 year ago

Do you mean the kubernetes objects? All of them? As yamls?

Sure, usually I do something like these two lines:

kubectl get crds | grep -o ".*.internal.linstor.linbit.com" | xargs -i{} sh -c "kubectl get {} -oyaml > ./k8s/{}.yaml"
kubectl get crd -oyaml > ./k8s_crds.yaml

dimm0 commented 1 year ago

Here's the attempt to run 1.24:

run-migration time="2023-08-09T06:37:58Z" level=info msg="running k8s-await-election" version=refs/tags/v0.3.1
run-migration time="2023-08-09T06:37:58Z" level=info msg="no status endpoint specified, will not be created"
run-migration I0809 06:37:58.920430       1 leaderelection.go:248] attempting to acquire leader lease piraeus-datastore/linstor-controller...
run-migration I0809 06:37:59.106675       1 leaderelection.go:258] successfully acquired lease piraeus-datastore/linstor-controller
run-migration time="2023-08-09T06:37:59Z" level=info msg="long live our new leader: 'linstor-controller-6d44f47c48-ndt4r'!"
run-migration time="2023-08-09T06:37:59Z" level=info msg="starting command '/usr/bin/piraeus-entry.sh' with arguments: '[runMigration]'"
run-migration Loading configuration file "/etc/linstor/linstor.toml"
run-migration INFO:    Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
run-migration INFO:    Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
run-migration INFO:    Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
run-migration DEBUG:   Constructing instance of module "com.linbit.linstor.modularcrypto.JclCryptoModule" with default constructor
run-migration INFO:    Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
run-migration INFO:    Cryptography provider: Using default cryptography module
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
run-migration INFO:    Kubernetes-CRD connection URL is "k8s"
run-migration 06:37:59.937 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client from Kubernetes config...
run-migration 06:37:59.939 [main] DEBUG io.fabric8.kubernetes.client.Config -- Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
run-migration 06:37:59.940 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client from service account...
run-migration 06:37:59.941 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account host and port: 10.96.0.1:443
run-migration 06:37:59.941 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt}].
run-migration 06:37:59.941 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
run-migration 06:37:59.941 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client namespace from Kubernetes service account namespace path...
run-migration 06:37:59.941 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
run-migration 06:37:59.948 [main] DEBUG io.fabric8.kubernetes.client.utils.HttpClientUtils -- Using httpclient io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory factory
run-migration TRACE:   Found database version 12
run-migration needs migration
run-migration NAME                                                     TYPE                        DATA   AGE
run-migration linstor-backup-for-linstor-controller-6d44f47c48-ndt4r   piraeus.io/linstor-backup   1      5m3s
run-migration Loading configuration file "/etc/linstor/linstor.toml"
run-migration INFO:    Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
run-migration INFO:    Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
run-migration INFO:    Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
run-migration DEBUG:   Constructing instance of module "com.linbit.linstor.modularcrypto.JclCryptoModule" with default constructor
run-migration INFO:    Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
run-migration INFO:    Cryptography provider: Using default cryptography module
run-migration INFO:    Initializing the k8s crd database connector
run-migration INFO:    Kubernetes-CRD connection URL is "k8s"
run-migration 06:38:04.581 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client from Kubernetes config...
run-migration 06:38:04.583 [main] DEBUG io.fabric8.kubernetes.client.Config -- Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
run-migration 06:38:04.584 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client from service account...
run-migration 06:38:04.585 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account host and port: 10.96.0.1:443
run-migration 06:38:04.585 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt}].
run-migration 06:38:04.585 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
run-migration 06:38:04.585 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client namespace from Kubernetes service account namespace path...
run-migration 06:38:04.585 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
run-migration 06:38:04.592 [main] DEBUG io.fabric8.kubernetes.client.utils.HttpClientUtils -- Using httpclient io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory factory
run-migration TRACE:   Found database version 12
run-migration DEBUG:   Migration DB: 12 -> 13: Upper case props instance
run-migration 06:38:11.513 [OkHttp https://10.96.0.1/...] DEBUG io.fabric8.kubernetes.client.http.StandardHttpClient -- HTTP operation on url: https://10.96.0.1:443/apis/internal.linstor.linbit.com/v1/rollback should be retried as the response code was 500, retrying after 100 millis
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
run-migration 06:38:15.392 [OkHttp https://10.96.0.1/...] DEBUG io.fabric8.kubernetes.client.http.StandardHttpClient -- HTTP operation on url: https://10.96.0.1:443/apis/internal.linstor.linbit.com/v1/rollback should be retried as the response code was 500, retrying after 200 millis
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
run-migration 06:38:19.398 [OkHttp https://10.96.0.1/...] DEBUG io.fabric8.kubernetes.client.http.StandardHttpClient -- HTTP operation on url: https://10.96.0.1:443/apis/internal.linstor.linbit.com/v1/rollback should be retried as the response code was 500, retrying after 400 millis
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
run-migration Exception in thread "main" picocli.CommandLine$ExecutionException: Error while calling command (com.linbit.linstor.core.LinstorConfigTool$CmdRunMigration@5ce81285): com.linbit.SystemServiceStartException: Database initialization error
run-migration     at picocli.CommandLine.executeUserObject(CommandLine.java:2050)
run-migration     at picocli.CommandLine.access$1500(CommandLine.java:148)
run-migration     at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
run-migration     at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
run-migration     at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
run-migration     at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2264)
run-migration     at picocli.CommandLine.parseWithHandlers(CommandLine.java:2664)
run-migration     at picocli.CommandLine.parseWithHandler(CommandLine.java:2599)
run-migration     at com.linbit.linstor.core.LinstorConfigTool.main(LinstorConfigTool.java:376)
run-migration Caused by: com.linbit.SystemServiceStartException: Database initialization error
run-migration     at com.linbit.linstor.dbcp.k8s.crd.DbK8sCrdInitializer.initialize(DbK8sCrdInitializer.java:59)
run-migration     at com.linbit.linstor.core.LinstorConfigTool$CmdRunMigration.call(LinstorConfigTool.java:336)
run-migration     at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
run-migration     ... 8 more
run-migration Caused by: com.linbit.linstor.LinStorDBRuntimeException: Exception occurred during migration
run-migration     at com.linbit.linstor.dbcp.k8s.crd.DbK8sCrd.migrate(DbK8sCrd.java:191)
run-migration     at com.linbit.linstor.dbcp.k8s.crd.DbK8sCrd.migrate(DbK8sCrd.java:124)
run-migration     at com.linbit.linstor.dbcp.k8s.crd.DbK8sCrdInitializer.initialize(DbK8sCrdInitializer.java:54)
run-migration     ... 10 more
run-migration Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1:443/apis/internal.linstor.linbit.com/v1/rollback. Message: etcdserver: request is too large. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=etcdserver: request is too large, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
run-migration     at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
run-migration     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:518)
run-migration     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:535)
run-migration     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:340)
run-migration     at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:703)
run-migration     at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:92)
run-migration     at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
run-migration     at com.linbit.linstor.transaction.ControllerK8sCrdRollbackMgr.createRollbackEntry(ControllerK8sCrdRollbackMgr.java:113)
run-migration     at com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr.commit(ControllerK8sCrdTransactionMgr.java:152)
run-migration     at com.linbit.linstor.dbcp.migration.k8s.crd.BaseK8sCrdMigration.migrate(BaseK8sCrdMigration.java:252)
run-migration     at com.linbit.linstor.dbcp.k8s.crd.DbK8sCrd.migrate(DbK8sCrd.java:179)
run-migration     ... 12 more
run-migration Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1:443/apis/internal.linstor.linbit.com/v1/rollback. Message: etcdserver: request is too large. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=etcdserver: request is too large, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
run-migration     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:671)
run-migration     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:651)
run-migration     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:600)
run-migration     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:560)
run-migration     at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642)
run-migration     at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
run-migration     at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
run-migration     at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:140)
run-migration     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
run-migration     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
run-migration     at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
run-migration     at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
run-migration     at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
run-migration     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
run-migration     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
run-migration     at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
run-migration     at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
run-migration     at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:137)
run-migration     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
run-migration     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
run-migration     at java.base/java.lang.Thread.run(Thread.java:829)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)

ghernadi commented 1 year ago

Right, thanks, we are already aware of this issue and are working on the fix for it.

dimm0 commented 1 year ago

Do you mean the kubernetes objects? All of them? As yamls?

Sure, usually I do something like these two lines:
kubectl get crds | grep -o ".*.internal.linstor.linbit.com" | xargs -i{} sh -c "kubectl get {} -oyaml > ./k8s/{}.yaml"
kubectl get crd -oyaml > ./k8s_crds.yaml

Done

ghernadi commented 1 year ago

Hello again! We tried to investigate a bit more into this issue and figured out that both errors (the Uncaught exception in k as well as the current NullPointerException) might only be side effects of another error you might simply have missed. Since the NullPointerException also had the error-number ...-00024, can you show us the previous error reports? In case the controller is up and running you can simply provide an sos-report. Feel free to either post it here or send it to me again as an email.

dimm0 commented 1 year ago

I've sent the whole /var/log/linstor-controller folder to the email, is that good?

ghernadi commented 1 year ago

Thanks for the reports, and yes they were helpful. It looks like you have some issues with your network, as the first few ErrorReports state:

Error message:                      Network is unreachable

I agree that LINSTOR should also handle this case better and not allow other components as SpaceTracking or the autoplacer to run into NullPointerExceptions like in your other ErrorReports, but for now you should investigate the connectivity issue to "fix" the problem. We will try to find a way to improve LINSTOR's error handling in this case.

dimm0 commented 1 year ago

Some satellites are not available, it's a big cluster.. Are you saying that's the problem? Ok, trying with just a few satellites (if it will let me remove the rest...)

dimm0 commented 1 year ago

After I manually deleted all "unknown" nodes, I could mount the volumes, as this error went away:

Reported error:
===============

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'listAvailableStorPools', Source file 'StorPoolFilter.java', Line #106

Error context:
    Registration of resource 'pvc-eb9013ba-6125-4a25-b780-ada9a47b3954' on node rci-nrp-dtn-01.sdsu.edu failed due to an unknown exception.

Asynchronous stage backtrace:

    Error has been observed at the following site(s):
        |_ checkpoint ? Place anywhere on node
    Stack trace:

Call backtrace:

    Method                                   Native Class:Line number
    listAvailableStorPools                   N      com.linbit.linstor.core.apicallhandler.controller.autoplacer.StorPoolFilter:106

Suppressed exception 1 of 1:
===============
Category:                           RuntimeException
Class name:                         OnAssemblyException
Class canonical name:               reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at:                       Method 'listAvailableStorPools', Source file 'StorPoolFilter.java', Line #106

Error message:
Error has been observed at the following site(s):
    |_ checkpoint ��� Place anywhere on node
Stack trace:

Error context:
    Registration of resource 'pvc-eb9013ba-6125-4a25-b780-ada9a47b3954' on node rci-nrp-dtn-01.sdsu.edu failed due to an unknown exception.

Call backtrace:

    Method                                   Native Class:Line number
    listAvailableStorPools                   N      com.linbit.linstor.core.apicallhandler.controller.autoplacer.StorPoolFilter:106
    autoPlace                                N      com.linbit.linstor.core.apicallhandler.controller.autoplacer.Autoplacer:74
    placeAnywhereInTransaction               N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscMakeAvailableApiCallHandler:699
    lambda$placeAnywhere$9                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscMakeAvailableApiCallHandler:554
    doInScope                                N      com.linbit.linstor.core.apicallhandler.ScopeRunner:149
    lambda$fluxInScope$0                     N      com.linbit.linstor.core.apicallhandler.ScopeRunner:76
    call                                     N      reactor.core.publisher.MonoCallable:91
    trySubscribeScalarMap                    N      reactor.core.publisher.FluxFlatMap:126
    subscribeOrReturn                        N      reactor.core.publisher.MonoFlatMapMany:49
    subscribe                                N      reactor.core.publisher.Flux:8343

dimm0 commented 1 year ago

I can't delete the broken nodes permanently. Once I delete the "unknown" nodes, it works, but after that operator re-adds those, even though I reduced the diskless satelliteset to just a few nodes

WanzenBug commented 1 year ago

This seems more like a operator issue then. Are you sure you used the right label to limit the satellite? You need to set the nodeSelector in the LinstorCluster resource. The LinstorSatelliteConfiguration labels only tell the Operator to which nodes the config should apply to.

dimm0 commented 1 year ago

I'm still having the above issue with controller crashlooping once I have "Unknown" nodes in the cluster (and I can't delete those because of an error in the operator) Can at least controller be fixed please?

LINBIT / linstor-server

Linstor controller crashlooping #364