Can not start linstor controller due to unhandled exception (upgrade 1.20.3 --> 1.21.0)

System components initialization in progress

Loading configuration file "/etc/linstor/linstor.toml"
14:37:17.177 [main] INFO  LINSTOR/Controller - SYSTEM - ErrorReporter DB first time init.
14:37:17.178 [main] INFO  LINSTOR/Controller - SYSTEM - Log directory set to: '/var/log/linstor-controller'
14:37:17.207 [main] INFO  LINSTOR/Controller - SYSTEM - Database type is Kubernetes-CRD
14:37:17.208 [Main] INFO  LINSTOR/Controller - SYSTEM - Loading API classes started.
14:37:17.440 [Main] INFO  LINSTOR/Controller - SYSTEM - API classes loading finished: 232ms
14:37:17.440 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection started.
14:37:17.451 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
14:37:17.451 [Main] INFO  LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
14:37:17.451 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
14:37:17.456 [Main] INFO  LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
14:37:17.456 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule"
14:37:17.457 [Main] INFO  LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" is not installed
14:37:17.989 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection finished: 549ms
14:37:17.990 [Main] INFO  LINSTOR/Controller - SYSTEM - Cryptography provider: Using default cryptography module
14:37:18.131 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing authentication subsystem
14:37:18.356 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
14:37:18.357 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing the k8s crd database connector
14:37:18.357 [Main] INFO  LINSTOR/Controller - SYSTEM - Kubernetes-CRD connection URL is "k8s"
14:37:19.481 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'K8sCrdDatabaseService' of type K8sCrdDatabaseService
14:37:19.487 [Main] INFO  LINSTOR/Controller - SYSTEM - Loading security objects
14:37:19.761 [Main] INFO  LINSTOR/Controller - SYSTEM - Current security level is NO_SECURITY
14:37:20.161 [Main] INFO  LINSTOR/Controller - SYSTEM - Core objects load from database is in progress
14:37:21.149 [Main] ERROR LINSTOR/Controller - SYSTEM - An exception occurred while creating layer data [Report number 6436C21C-00000-000000]

14:37:21.156 [Main] ERROR LINSTOR/Controller - SYSTEM - Unhandled exception [Report number 6436C21C-00000-000001]

14:37:21.156 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutdown in progress
14:37:21.158 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'EbsStatusPoll' of type EbsStatusPoll
14:37:21.159 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'EbsStatusPoll' to complete shutdown
14:37:21.159 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'ScheduleBackupService' of type ScheduleBackupService
14:37:21.159 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'ScheduleBackupService' to complete shutdown
14:37:21.160 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'TaskScheduleService' of type TaskScheduleService
14:37:21.161 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'TaskScheduleService' to complete shutdown
14:37:21.162 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'K8sCrdDatabaseService' of type K8sCrdDatabaseService
14:37:21.164 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'K8sCrdDatabaseService' to complete shutdown
14:37:21.164 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'TimerEventService' of type TimerEventService
14:37:21.164 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'TimerEventService' to complete shutdown
14:37:21.164 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutdown complete

ERROR REPORT 6436C21C-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.21.0
Build ID:                           b44bb8d41f264ac1089d9a0a1c540d3cc703d7e8
Build time:                         2023-04-04T10:11:03+00:00
Error time:                         2023-04-12 14:37:21
Node:                               linstor-controller-567bc58d98-gttls

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         ApiRcException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.ApiRcException
Generated at:                       Method 'ensureStackDataExists', Source file 'CtrlRscLayerDataFactory.java', Line #213

Error message:                      An exception occurred while creating layer data

ApiRcException entries:
Nr: 1
  Message: An exception occurred while creating layer data

Call backtrace:

    Method                                   Native Class:Line number
    ensureStackDataExists                    N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:213
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:739
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:618
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:365
    main                                     N      com.linbit.linstor.core.Controller:613

Caused by:
==========

Category:                           RuntimeException
Class name:                         ClassCastException
Class canonical name:               java.lang.ClassCastException
Generated at:                       Method 'mergeRscData', Source file 'RscStorageLayerHelper.java', Line #70

Error message:                      class com.linbit.linstor.storage.data.adapter.luks.LuksRscData cannot be cast to class com.linbit.linstor.storage.data.provider.StorageRscData (com.linbit.linstor.storage.data.adapter.luks.LuksRscData and com.linbit.linstor.storage.data.provider.StorageRscData are in unnamed module of loader 'app')

Call backtrace:

    Method                                   Native Class:Line number
    mergeRscData                             N      com.linbit.linstor.layer.resource.RscStorageLayerHelper:70
    ensureRscDataCreated                     N      com.linbit.linstor.layer.resource.AbsRscLayerHelper:247
    ensureDataRec                            N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:240
    ensureDataRec                            N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:276
    ensureDataRec                            N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:259
    ensureStackDataExists                    N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:172
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:739
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:618
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:365
    main                                     N      com.linbit.linstor.core.Controller:613

END OF ERROR REPORT.

ERROR REPORT 6436C21C-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.21.0
Build ID:                           b44bb8d41f264ac1089d9a0a1c540d3cc703d7e8
Build time:                         2023-04-04T10:11:03+00:00
Error time:                         2023-04-12 14:37:21
Node:                               linstor-controller-567bc58d98-gttls

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         ApiRcException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.ApiRcException
Generated at:                       Method 'ensureStackDataExists', Source file 'CtrlRscLayerDataFactory.java', Line #213

Error message:                      An exception occurred while creating layer data

ApiRcException entries:
Nr: 1
  Message: An exception occurred while creating layer data

Call backtrace:

    Method                                   Native Class:Line number
    ensureStackDataExists                    N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:213
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:739
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:618
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:365
    main                                     N      com.linbit.linstor.core.Controller:613

Caused by:
==========

Category:                           RuntimeException
Class name:                         ClassCastException
Class canonical name:               java.lang.ClassCastException
Generated at:                       Method 'mergeRscData', Source file 'RscStorageLayerHelper.java', Line #70

Error message:                      class com.linbit.linstor.storage.data.adapter.luks.LuksRscData cannot be cast to class com.linbit.linstor.storage.data.provider.StorageRscData (com.linbit.linstor.storage.data.adapter.luks.LuksRscData and com.linbit.linstor.storage.data.provider.StorageRscData are in unnamed module of loader 'app')

Call backtrace:

    Method                                   Native Class:Line number
    mergeRscData                             N      com.linbit.linstor.layer.resource.RscStorageLayerHelper:70
    ensureRscDataCreated                     N      com.linbit.linstor.layer.resource.AbsRscLayerHelper:247
    ensureDataRec                            N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:240
    ensureDataRec                            N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:276
    ensureDataRec                            N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:259
    ensureStackDataExists                    N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:172
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:739
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:618
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:365
    main                                     N      com.linbit.linstor.core.Controller:613

END OF ERROR REPORT.
root@linstor-controller-567bc58d98-gttls:/# cat /var/log/linstor-controller/ErrorReport-6436C21C-00000-000001.log
ERROR REPORT 6436C21C-00000-000001

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.21.0
Build ID:                           b44bb8d41f264ac1089d9a0a1c540d3cc703d7e8
Build time:                         2023-04-04T10:11:03+00:00
Error time:                         2023-04-12 14:37:21
Node:                               linstor-controller-567bc58d98-gttls

============================================================

Reported error:
===============

Description:
    Unhandled exception

Category:                           LinStorException
Class name:                         SystemServiceStartException
Class canonical name:               com.linbit.SystemServiceStartException
Generated at:                       Method 'startSystemServices', Source file 'ApplicationLifecycleManager.java', Line #103

Error message:                      Unhandled exception

Call backtrace:

    Method                                   Native Class:Line number
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:103
    start                                    N      com.linbit.linstor.core.Controller:365
    main                                     N      com.linbit.linstor.core.Controller:613

Caused by:
==========

Category:                           RuntimeException
Class name:                         ApiRcException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.ApiRcException
Generated at:                       Method 'ensureStackDataExists', Source file 'CtrlRscLayerDataFactory.java', Line #213

Error message:                      An exception occurred while creating layer data

ApiRcException entries:
Nr: 1
  Message: An exception occurred while creating layer data

Call backtrace:

    Method                                   Native Class:Line number
    ensureStackDataExists                    N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:213
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:739
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:618
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:365
    main                                     N      com.linbit.linstor.core.Controller:613

Caused by:
==========

Category:                           RuntimeException
Class name:                         ClassCastException
Class canonical name:               java.lang.ClassCastException
Generated at:                       Method 'mergeRscData', Source file 'RscStorageLayerHelper.java', Line #70

Error message:                      class com.linbit.linstor.storage.data.adapter.luks.LuksRscData cannot be cast to class com.linbit.linstor.storage.data.provider.StorageRscData (com.linbit.linstor.storage.data.adapter.luks.LuksRscData and com.linbit.linstor.storage.data.provider.StorageRscData are in unnamed module of loader 'app')

Call backtrace:

    Method                                   Native Class:Line number
    mergeRscData                             N      com.linbit.linstor.layer.resource.RscStorageLayerHelper:70
    ensureRscDataCreated                     N      com.linbit.linstor.layer.resource.AbsRscLayerHelper:247
    ensureDataRec                            N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:240
    ensureDataRec                            N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:276
    ensureDataRec                            N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:259
    ensureStackDataExists                    N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:172
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:739
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:618
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:365
    main                                     N      com.linbit.linstor.core.Controller:613

END OF ERROR REPORT.

@ghernadi thanks for guidance, here is my journey for fixing this issue.

short disclaimer if someone thinks to repeat some of my steps: PLEASE DO IT ON YOUR OWN RISK

First, I made a backup of all linstor resources:

kubectl get crds | grep -o ".*.internal.linstor.linbit.com" |   xargs kubectl get crds -ojson > crds.json
kubectl get crds | grep -o ".*.internal.linstor.linbit.com" | xargs -i{} sh -xc "kubectl get {} -ojson > {}.json"

Next I found all the resouces in weird states:

0 is diskful diskless have either 260 (diskless+drbd_diskless) or 388 (diskless+drbd_diskless+tiebreaker) if you check the flags in binary, 388 would be 1 1000 0100. the flags 1 1xxx x100 could also be set, which are some toggle-disk states.

cat resources.internal.linstor.linbit.com.json  | jq '.items[] | select(.spec.resource_flags!=0 and .spec.resource_flags!=260 and .spec.resource_flags!=388) | "\(.spec.resource_name) \(.spec.node_name) \(.spec.resource_flags)"' -r > list.txt

Finaly I've got list.txt of resources and their flags:

PVC-35C23274-3F53-47AD-ACC8-B01AD2E7E2A0 A-HV-1 527844
PVC-F9AB4C0F-0B8C-43EB-89DC-E5DBD84FE5E6 C-HV-4 2048
PVC-4E76FCFF-3293-4A5A-945F-7B4179DF74BE A-HV-1 265700
PVC-5BB88AA7-A018-4428-87C3-AFB05E2CB5D5 B-HV-3 263524
PVC-1DD492B4-9A89-4581-B20F-178DCE17200B B-HV-3 265572
PVC-7EE791AA-A155-4744-ABB8-DF1486A1F879 A-HV-1 789988
PVC-7CF9436F-497F-4BEC-8F6D-94708FFD5090 C-HV-5 787456
PVC-4E76FCFF-3293-4A5A-945F-7B4179DF74BE C-HV-4 263168

Then I decided to check if they are really diskful. Obviously they must have some storage layers:

while read res node flags; do cat layerresourceids.internal.linstor.linbit.com.json  | jq -c '.items[] | select(.spec.resource_name==$res and .spec.node_name==$node)' --arg res $res --arg node $node | grep -q '{' && echo $res $node diskful || echo $res $node diskless; done < list.txt

Likely all of them turned out to be diskful

PVC-35C23274-3F53-47AD-ACC8-B01AD2E7E2A0 A-HV-1 diskful
PVC-F9AB4C0F-0B8C-43EB-89DC-E5DBD84FE5E6 C-HV-4 diskful
PVC-4E76FCFF-3293-4A5A-945F-7B4179DF74BE A-HV-1 diskful
PVC-5BB88AA7-A018-4428-87C3-AFB05E2CB5D5 B-HV-3 diskful
PVC-1DD492B4-9A89-4581-B20F-178DCE17200B B-HV-3 diskful
PVC-7EE791AA-A155-4744-ABB8-DF1486A1F879 A-HV-1 diskful
PVC-7CF9436F-497F-4BEC-8F6D-94708FFD5090 C-HV-5 diskful
PVC-4E76FCFF-3293-4A5A-945F-7B4179DF74BE C-HV-4 diskful

Okay, then I prepared fix, to change their flags to 0:

while read res node flags; do cat resources.internal.linstor.linbit.com.json  | jq '.items[] | select(.spec.resource_name==$res and .spec.node_name==$node) | .spec.resource_flags=0' --arg res $res --arg node $node -r ; done < list.txt > fix.json

Applied it:

kubectl delete -f fix.json
kubectl apply -f fix.json

Started controller, and hooray it was loaded successfully, however some of these resource were still persisted in Unknown state:

linstor --controllers 127.0.0.1 r l | grep Unknown

| pvc-1dd492b4-9a89-4581-b20f-178dce17200b | b-hv-3       | 7021 |        |                    |            Unknown | 2023-03-09 08:35:22 |
| pvc-4e76fcff-3293-4a5a-945f-7b4179df74be | a-hv-1       | 7022 |        |                    |            Unknown | 2023-03-09 08:36:08 |
| pvc-5bb88aa7-a018-4428-87c3-afb05e2cb5d5 | b-hv-3       | 7023 |        |                    |            Unknown | 2023-04-11 15:47:30 |
| pvc-7ee791aa-a155-4744-abb8-df1486a1f879 | a-hv-1       | 7017 | Unused |                    |            Unknown | 2023-03-09 08:19:03 |
| pvc-35c23274-3f53-47ad-acc8-b01ad2e7e2a0 | a-hv-1       | 7024 | Unused |                    |            Unknown | 2023-03-09 08:36:08 |

There were no opportunity to delete them even after linstor-satellite restart:

# linstor r d b-hv-3 pvc-1dd492b4-9a89-4581-b20f-178dce17200b
INFO:
    Resource-definition property 'DrbdOptions/Resource/quorum' updated from 'off' to 'majority' by auto-quorum
INFO:
    Resource-definition property 'DrbdOptions/Resource/on-no-quorum' updated from 'off' to 'suspend-io' by auto-quorum
SUCCESS:
Description:
    Node: b-hv-3, Resource: pvc-1dd492b4-9a89-4581-b20f-178dce17200b preparing for deletion.
Details:
    Node: b-hv-3, Resource: pvc-1dd492b4-9a89-4581-b20f-178dce17200b UUID is: 55c99c65-2c7e-435b-ba08-f1a6b68c345f
ERROR:
    (Node: 'b-hv-3') An unknown exception occurred while processing the resource pvc-1dd492b4-9a89-4581-b20f-178dce17200b
Show reports:
    linstor error-reports show 64380C9B-9841A-000074
SUCCESS:
    Preparing deletion of resource on 'madison-db-1'
SUCCESS:
    Preparing deletion of resource on 'a-hv-1'
ERROR:
Description:
    Deletion of resource 'pvc-1dd492b4-9a89-4581-b20f-178dce17200b' on node 'b-hv-3' failed due to an unknown exception.
Details:
    Node: b-hv-3, Resource: pvc-1dd492b4-9a89-4581-b20f-178dce17200b
Show reports:
    linstor error-reports show 64386E36-00000-000000

Thus I decided to find and remove all of them from database.

This short script will generate the name of all related objects to this resource in Kubernetes CRs

#/bin/bash
res=pvc-1dd492b4-9a89-4581-b20f-178dce17200b
node=b-hv-3
cat *.internal.linstor.linbit.com.json | jq -c '.items[] | select(.spec.resource_name==$res and .spec.node_name==$node)' --arg res ${res^^} --arg node ${node^^} | jq -r '"\(.kind).\(.apiVersion|split("/")[0])/\(.metadata.name)"'

example output:

LayerResourceIds.internal.linstor.linbit.com/284de502c9847342318c17d474733ef468fbdbe252cddf6e4b4be0676706d9d0
LayerResourceIds.internal.linstor.linbit.com/68519a9eca55c68c72658a2a1716aac3788c289859d46d6f5c3f14760fa37c9e
LayerResourceIds.internal.linstor.linbit.com/734d0759cdb4e0d0a35e4fd73749aee287e4fdcc8648b71a8d6ed591b7d4cb3f
Resources.internal.linstor.linbit.com/a05c4b07d52a83fb69482d51df83399adc7eceb3824f4c79e9d097c14e63ef36
Volumes.internal.linstor.linbit.com/06eef302015d08b2095174dc1d701a6d6131caac77be96a8c5fd99505214d96f

so they can be simple removed using kubectl delete command.

When I started controller after that, all unwanted resources were gone :tada:

LINBIT / linstor-server

Can not start linstor controller due to unhandled exception (upgrade 1.20.3 --> 1.21.0) #348