Open dimm0 opened 1 year ago
Thank you. I have found and fixed the bug causing the NullPointerException.
However, said exception might kill the thread that collects data for SpaceTracking, which is not ideal I agree on that, but I do not see reasons why the entire controller should be killed from this exception.
I do see the last log line, but can you please additionally check for additional logs or more ErrorReports? Or is it possible that the machine simply runs out of memory?
There's no last line, I think it's getting killed before the line is flushed. And no, it has plenty of memory
Although I think I saw some errors about getting disconnected from peers.
@WanzenBug could you comment on why the controller is killed when there are errors? I think I've seen a similar behavior when it couldn't resize a volume (https://github.com/piraeusdatastore/piraeus-operator/issues/345)
It should not get killed. The only reason would be if either the liveness probe fails (unlikely, as the probe is not even using the LINSTOR API), or if the LINSTOR Controller exits.
To get more logs, you can use a strategy described here: https://github.com/piraeusdatastore/piraeus-operator/issues/184#issuecomment-851250374
Ok, I figured the problem. The controller is killed because the livenessProbe is failing, and it's failing because the health check returns "Services not running: SpaceTrackingService"
Another error now:
ERROR REPORT 64CBFA4E-00000-000024
============================================================
Application: LINBIT�� LINSTOR
Module: Controller
Version: 1.23.0
Build ID: 28dbd33ced60d75a2a0562bf5e9bc6b800ae8361
Build time: 2023-05-23T06:27:14+00:00
Error time: 2023-08-03 19:20:48
Node: linstor-controller-58c7d99c94-9rqws
Peer: RestClient(10.244.193.250; 'linstor-csi/v1.1.0-20969df70962927a06cbdf714e9ca8cc3912cb4d')
============================================================
Reported error:
===============
Category: RuntimeException
Class name: NullPointerException
Class canonical name: java.lang.NullPointerException
Generated at: Method 'listAvailableStorPools', Source file 'StorPoolFilter.java', Line #106
Error context:
Registration of resource 'pvc-c96fe2dd-6119-4677-bf6a-3985056b04f9' on node rci-nrp-gpu-03.sdsu.edu failed due to an unknown exception.
Asynchronous stage backtrace:
Error has been observed at the following site(s):
|_ checkpoint ? Place anywhere on node
Stack trace:
Call backtrace:
Method Native Class:Line number
listAvailableStorPools N com.linbit.linstor.core.apicallhandler.controller.autoplacer.StorPoolFilter:106
Suppressed exception 1 of 1:
===============
Category: RuntimeException
Class name: OnAssemblyException
Class canonical name: reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at: Method 'listAvailableStorPools', Source file 'StorPoolFilter.java', Line #106
Error message:
Error has been observed at the following site(s):
|_ checkpoint ��� Place anywhere on node
Stack trace:
Error context:
Registration of resource 'pvc-c96fe2dd-6119-4677-bf6a-3985056b04f9' on node rci-nrp-gpu-03.sdsu.edu failed due to an unknown exception.
Call backtrace:
Method Native Class:Line number
listAvailableStorPools N com.linbit.linstor.core.apicallhandler.controller.autoplacer.StorPoolFilter:106
autoPlace N com.linbit.linstor.core.apicallhandler.controller.autoplacer.Autoplacer:74
placeAnywhereInTransaction N com.linbit.linstor.core.apicallhandler.controller.CtrlRscMakeAvailableApiCallHandler:699
lambda$placeAnywhere$9 N com.linbit.linstor.core.apicallhandler.controller.CtrlRscMakeAvailableApiCallHandler:554
doInScope N com.linbit.linstor.core.apicallhandler.ScopeRunner:149
lambda$fluxInScope$0 N com.linbit.linstor.core.apicallhandler.ScopeRunner:76
call N reactor.core.publisher.MonoCallable:91
trySubscribeScalarMap N reactor.core.publisher.FluxFlatMap:126
subscribeOrReturn N reactor.core.publisher.MonoFlatMapMany:49
subscribe N reactor.core.publisher.Flux:8343
onNext N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
request N reactor.core.publisher.Operators$ScalarSubscription:2344
onSubscribe N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
subscribe N reactor.core.publisher.MonoCurrentContext:35
subscribe N reactor.core.publisher.Flux:8357
onNext N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
request N reactor.core.publisher.Operators$ScalarSubscription:2344
onSubscribe N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
subscribe N reactor.core.publisher.MonoCurrentContext:35
subscribe N reactor.core.publisher.Flux:8357
trySubscribeScalarMap N reactor.core.publisher.FluxFlatMap:199
subscribeOrReturn N reactor.core.publisher.MonoFlatMapMany:49
subscribe N reactor.core.publisher.Flux:8343
onNext N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
request N reactor.core.publisher.Operators$ScalarSubscription:2344
onSubscribe N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
subscribe N reactor.core.publisher.MonoCurrentContext:35
subscribe N reactor.core.publisher.Flux:8357
onNext N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
onNext N reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber:121
complete N reactor.core.publisher.Operators$MonoSubscriber:1782
onComplete N reactor.core.publisher.MonoCollect$CollectSubscriber:152
onComplete N reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner:252
checkTerminated N reactor.core.publisher.FluxFlatMap$FlatMapMain:838
drainLoop N reactor.core.publisher.FluxFlatMap$FlatMapMain:600
drain N reactor.core.publisher.FluxFlatMap$FlatMapMain:580
onComplete N reactor.core.publisher.FluxFlatMap$FlatMapMain:457
checkTerminated N reactor.core.publisher.FluxFlatMap$FlatMapMain:838
drainLoop N reactor.core.publisher.FluxFlatMap$FlatMapMain:600
innerComplete N reactor.core.publisher.FluxFlatMap$FlatMapMain:909
onComplete N reactor.core.publisher.FluxFlatMap$FlatMapInner:1013
onComplete N reactor.core.publisher.FluxMap$MapSubscriber:136
onComplete N reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2016
onComplete N reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber:78
complete N reactor.core.publisher.FluxCreate$BaseSink:438
drain N reactor.core.publisher.FluxCreate$BufferAsyncSink:784
complete N reactor.core.publisher.FluxCreate$BufferAsyncSink:732
drainLoop N reactor.core.publisher.FluxCreate$SerializedSink:239
drain N reactor.core.publisher.FluxCreate$SerializedSink:205
complete N reactor.core.publisher.FluxCreate$SerializedSink:196
apiCallComplete N com.linbit.linstor.netcom.TcpConnectorPeer:470
handleComplete N com.linbit.linstor.proto.CommonMessageProcessor:363
handleDataMessage N com.linbit.linstor.proto.CommonMessageProcessor:287
doProcessInOrderMessage N com.linbit.linstor.proto.CommonMessageProcessor:235
lambda$doProcessMessage$3 N com.linbit.linstor.proto.CommonMessageProcessor:220
subscribe N reactor.core.publisher.FluxDefer:46
subscribe N reactor.core.publisher.Flux:8357
onNext N reactor.core.publisher.FluxFlatMap$FlatMapMain:418
drainAsync N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:414
drain N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:679
onNext N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:243
drainFused N reactor.core.publisher.UnicastProcessor:286
drain N reactor.core.publisher.UnicastProcessor:329
onNext N reactor.core.publisher.UnicastProcessor:408
next N reactor.core.publisher.FluxCreate$IgnoreSink:618
drainLoop N reactor.core.publisher.FluxCreate$SerializedSink:248
next N reactor.core.publisher.FluxCreate$SerializedSink:168
processInOrder N com.linbit.linstor.netcom.TcpConnectorPeer:388
doProcessMessage N com.linbit.linstor.proto.CommonMessageProcessor:218
lambda$processMessage$2 N com.linbit.linstor.proto.CommonMessageProcessor:164
onNext N reactor.core.publisher.FluxPeek$PeekSubscriber:177
runAsync N reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:439
run N reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:526
call N reactor.core.scheduler.WorkerTask:84
call N reactor.core.scheduler.WorkerTask:37
run N java.util.concurrent.FutureTask:264
run N java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
runWorker N java.util.concurrent.ThreadPoolExecutor:1128
run N java.util.concurrent.ThreadPoolExecutor$Worker:628
run N java.lang.Thread:829
END OF ERROR REPORT.
Could I please get the image with first error fixed at least?
Still stuck and can't use the storage... Help please
The fix for the first exception Uncaught exception in k
was released with v1.24.0.
For the second exception, I'm still not sure how you can get a null reference there. Would you mind sending me a database dump to my email address (see my profile) so that I can investigate a bit further?
The fix for the first exception
Uncaught exception in k
was released with v1.24.0.
Ah, I missed it. Thanks!
For the second exception, I'm still not sure how you can get a null reference there. Would you mind sending me a database dump to my email address (see my profile) so that I can investigate a bit further?
Do you mean the kubernetes objects? All of them? As yamls?
Do you mean the kubernetes objects? All of them? As yamls?
Sure, usually I do something like these two lines:
kubectl get crds | grep -o ".*.internal.linstor.linbit.com" | xargs -i{} sh -c "kubectl get {} -oyaml > ./k8s/{}.yaml"
kubectl get crd -oyaml > ./k8s_crds.yaml
Here's the attempt to run 1.24:
run-migration time="2023-08-09T06:37:58Z" level=info msg="running k8s-await-election" version=refs/tags/v0.3.1
run-migration time="2023-08-09T06:37:58Z" level=info msg="no status endpoint specified, will not be created"
run-migration I0809 06:37:58.920430 1 leaderelection.go:248] attempting to acquire leader lease piraeus-datastore/linstor-controller...
run-migration I0809 06:37:59.106675 1 leaderelection.go:258] successfully acquired lease piraeus-datastore/linstor-controller
run-migration time="2023-08-09T06:37:59Z" level=info msg="long live our new leader: 'linstor-controller-6d44f47c48-ndt4r'!"
run-migration time="2023-08-09T06:37:59Z" level=info msg="starting command '/usr/bin/piraeus-entry.sh' with arguments: '[runMigration]'"
run-migration Loading configuration file "/etc/linstor/linstor.toml"
run-migration INFO: Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
run-migration INFO: Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
run-migration INFO: Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
run-migration DEBUG: Constructing instance of module "com.linbit.linstor.modularcrypto.JclCryptoModule" with default constructor
run-migration INFO: Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
run-migration INFO: Cryptography provider: Using default cryptography module
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
run-migration INFO: Kubernetes-CRD connection URL is "k8s"
run-migration 06:37:59.937 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client from Kubernetes config...
run-migration 06:37:59.939 [main] DEBUG io.fabric8.kubernetes.client.Config -- Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
run-migration 06:37:59.940 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client from service account...
run-migration 06:37:59.941 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account host and port: 10.96.0.1:443
run-migration 06:37:59.941 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt}].
run-migration 06:37:59.941 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
run-migration 06:37:59.941 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client namespace from Kubernetes service account namespace path...
run-migration 06:37:59.941 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
run-migration 06:37:59.948 [main] DEBUG io.fabric8.kubernetes.client.utils.HttpClientUtils -- Using httpclient io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory factory
run-migration TRACE: Found database version 12
run-migration needs migration
run-migration NAME TYPE DATA AGE
run-migration linstor-backup-for-linstor-controller-6d44f47c48-ndt4r piraeus.io/linstor-backup 1 5m3s
run-migration Loading configuration file "/etc/linstor/linstor.toml"
run-migration INFO: Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
run-migration INFO: Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
run-migration INFO: Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
run-migration DEBUG: Constructing instance of module "com.linbit.linstor.modularcrypto.JclCryptoModule" with default constructor
run-migration INFO: Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
run-migration INFO: Cryptography provider: Using default cryptography module
run-migration INFO: Initializing the k8s crd database connector
run-migration INFO: Kubernetes-CRD connection URL is "k8s"
run-migration 06:38:04.581 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client from Kubernetes config...
run-migration 06:38:04.583 [main] DEBUG io.fabric8.kubernetes.client.Config -- Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
run-migration 06:38:04.584 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client from service account...
run-migration 06:38:04.585 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account host and port: 10.96.0.1:443
run-migration 06:38:04.585 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt}].
run-migration 06:38:04.585 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
run-migration 06:38:04.585 [main] DEBUG io.fabric8.kubernetes.client.Config -- Trying to configure client namespace from Kubernetes service account namespace path...
run-migration 06:38:04.585 [main] DEBUG io.fabric8.kubernetes.client.Config -- Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
run-migration 06:38:04.592 [main] DEBUG io.fabric8.kubernetes.client.utils.HttpClientUtils -- Using httpclient io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory factory
run-migration TRACE: Found database version 12
run-migration DEBUG: Migration DB: 12 -> 13: Upper case props instance
run-migration 06:38:11.513 [OkHttp https://10.96.0.1/...] DEBUG io.fabric8.kubernetes.client.http.StandardHttpClient -- HTTP operation on url: https://10.96.0.1:443/apis/internal.linstor.linbit.com/v1/rollback should be retried as the response code was 500, retrying after 100 millis
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
run-migration 06:38:15.392 [OkHttp https://10.96.0.1/...] DEBUG io.fabric8.kubernetes.client.http.StandardHttpClient -- HTTP operation on url: https://10.96.0.1:443/apis/internal.linstor.linbit.com/v1/rollback should be retried as the response code was 500, retrying after 200 millis
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
run-migration 06:38:19.398 [OkHttp https://10.96.0.1/...] DEBUG io.fabric8.kubernetes.client.http.StandardHttpClient -- HTTP operation on url: https://10.96.0.1:443/apis/internal.linstor.linbit.com/v1/rollback should be retried as the response code was 500, retrying after 400 millis
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
run-migration Exception in thread "main" picocli.CommandLine$ExecutionException: Error while calling command (com.linbit.linstor.core.LinstorConfigTool$CmdRunMigration@5ce81285): com.linbit.SystemServiceStartException: Database initialization error
run-migration at picocli.CommandLine.executeUserObject(CommandLine.java:2050)
run-migration at picocli.CommandLine.access$1500(CommandLine.java:148)
run-migration at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
run-migration at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
run-migration at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
run-migration at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2264)
run-migration at picocli.CommandLine.parseWithHandlers(CommandLine.java:2664)
run-migration at picocli.CommandLine.parseWithHandler(CommandLine.java:2599)
run-migration at com.linbit.linstor.core.LinstorConfigTool.main(LinstorConfigTool.java:376)
run-migration Caused by: com.linbit.SystemServiceStartException: Database initialization error
run-migration at com.linbit.linstor.dbcp.k8s.crd.DbK8sCrdInitializer.initialize(DbK8sCrdInitializer.java:59)
run-migration at com.linbit.linstor.core.LinstorConfigTool$CmdRunMigration.call(LinstorConfigTool.java:336)
run-migration at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
run-migration ... 8 more
run-migration Caused by: com.linbit.linstor.LinStorDBRuntimeException: Exception occurred during migration
run-migration at com.linbit.linstor.dbcp.k8s.crd.DbK8sCrd.migrate(DbK8sCrd.java:191)
run-migration at com.linbit.linstor.dbcp.k8s.crd.DbK8sCrd.migrate(DbK8sCrd.java:124)
run-migration at com.linbit.linstor.dbcp.k8s.crd.DbK8sCrdInitializer.initialize(DbK8sCrdInitializer.java:54)
run-migration ... 10 more
run-migration Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1:443/apis/internal.linstor.linbit.com/v1/rollback. Message: etcdserver: request is too large. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=etcdserver: request is too large, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
run-migration at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
run-migration at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:518)
run-migration at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:535)
run-migration at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:340)
run-migration at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:703)
run-migration at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:92)
run-migration at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
run-migration at com.linbit.linstor.transaction.ControllerK8sCrdRollbackMgr.createRollbackEntry(ControllerK8sCrdRollbackMgr.java:113)
run-migration at com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr.commit(ControllerK8sCrdTransactionMgr.java:152)
run-migration at com.linbit.linstor.dbcp.migration.k8s.crd.BaseK8sCrdMigration.migrate(BaseK8sCrdMigration.java:252)
run-migration at com.linbit.linstor.dbcp.k8s.crd.DbK8sCrd.migrate(DbK8sCrd.java:179)
run-migration ... 12 more
run-migration Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1:443/apis/internal.linstor.linbit.com/v1/rollback. Message: etcdserver: request is too large. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=etcdserver: request is too large, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
run-migration at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:671)
run-migration at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:651)
run-migration at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:600)
run-migration at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:560)
run-migration at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642)
run-migration at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
run-migration at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
run-migration at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:140)
run-migration at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
run-migration at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
run-migration at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
run-migration at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
run-migration at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
run-migration at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
run-migration at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
run-migration at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
run-migration at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
run-migration at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:137)
run-migration at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
run-migration at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
run-migration at java.base/java.lang.Thread.run(Thread.java:829)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
stream logs failed container "linstor-controller" in pod "linstor-controller-6d44f47c48-ndt4r" is waiting to start: PodInitializing for piraeus-datastore/linstor-controller-6d44f47c48-ndt4r (linstor-controller)
Right, thanks, we are already aware of this issue and are working on the fix for it.
Do you mean the kubernetes objects? All of them? As yamls?
Sure, usually I do something like these two lines:
kubectl get crds | grep -o ".*.internal.linstor.linbit.com" | xargs -i{} sh -c "kubectl get {} -oyaml > ./k8s/{}.yaml" kubectl get crd -oyaml > ./k8s_crds.yaml
Done
Hello again!
We tried to investigate a bit more into this issue and figured out that both errors (the Uncaught exception in k
as well as the current NullPointerException
) might only be side effects of another error you might simply have missed. Since the NullPointerException
also had the error-number ...-00024
, can you show us the previous error reports? In case the controller is up and running you can simply provide an sos-report. Feel free to either post it here or send it to me again as an email.
I've sent the whole /var/log/linstor-controller folder to the email, is that good?
Thanks for the reports, and yes they were helpful. It looks like you have some issues with your network, as the first few ErrorReports state:
Error message: Network is unreachable
I agree that LINSTOR should also handle this case better and not allow other components as SpaceTracking or the autoplacer to run into NullPointerExceptions like in your other ErrorReports, but for now you should investigate the connectivity issue to "fix" the problem. We will try to find a way to improve LINSTOR's error handling in this case.
Some satellites are not available, it's a big cluster.. Are you saying that's the problem? Ok, trying with just a few satellites (if it will let me remove the rest...)
After I manually deleted all "unknown" nodes, I could mount the volumes, as this error went away:
Reported error:
===============
Category: RuntimeException
Class name: NullPointerException
Class canonical name: java.lang.NullPointerException
Generated at: Method 'listAvailableStorPools', Source file 'StorPoolFilter.java', Line #106
Error context:
Registration of resource 'pvc-eb9013ba-6125-4a25-b780-ada9a47b3954' on node rci-nrp-dtn-01.sdsu.edu failed due to an unknown exception.
Asynchronous stage backtrace:
Error has been observed at the following site(s):
|_ checkpoint ? Place anywhere on node
Stack trace:
Call backtrace:
Method Native Class:Line number
listAvailableStorPools N com.linbit.linstor.core.apicallhandler.controller.autoplacer.StorPoolFilter:106
Suppressed exception 1 of 1:
===============
Category: RuntimeException
Class name: OnAssemblyException
Class canonical name: reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at: Method 'listAvailableStorPools', Source file 'StorPoolFilter.java', Line #106
Error message:
Error has been observed at the following site(s):
|_ checkpoint ��� Place anywhere on node
Stack trace:
Error context:
Registration of resource 'pvc-eb9013ba-6125-4a25-b780-ada9a47b3954' on node rci-nrp-dtn-01.sdsu.edu failed due to an unknown exception.
Call backtrace:
Method Native Class:Line number
listAvailableStorPools N com.linbit.linstor.core.apicallhandler.controller.autoplacer.StorPoolFilter:106
autoPlace N com.linbit.linstor.core.apicallhandler.controller.autoplacer.Autoplacer:74
placeAnywhereInTransaction N com.linbit.linstor.core.apicallhandler.controller.CtrlRscMakeAvailableApiCallHandler:699
lambda$placeAnywhere$9 N com.linbit.linstor.core.apicallhandler.controller.CtrlRscMakeAvailableApiCallHandler:554
doInScope N com.linbit.linstor.core.apicallhandler.ScopeRunner:149
lambda$fluxInScope$0 N com.linbit.linstor.core.apicallhandler.ScopeRunner:76
call N reactor.core.publisher.MonoCallable:91
trySubscribeScalarMap N reactor.core.publisher.FluxFlatMap:126
subscribeOrReturn N reactor.core.publisher.MonoFlatMapMany:49
subscribe N reactor.core.publisher.Flux:8343
I can't delete the broken nodes permanently. Once I delete the "unknown" nodes, it works, but after that operator re-adds those, even though I reduced the diskless satelliteset to just a few nodes
This seems more like a operator issue then. Are you sure you used the right label to limit the satellite? You need to set the nodeSelector in the LinstorCluster
resource. The LinstorSatelliteConfiguration
labels only tell the Operator to which nodes the config should apply to.
I'm still having the above issue with controller crashlooping once I have "Unknown" nodes in the cluster (and I can't delete those because of an error in the operator) Can at least controller be fixed please?
Controller logs: