LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
984 stars 76 forks source link

Linstor controller can't start #357

Closed duckhawk closed 1 year ago

duckhawk commented 1 year ago

Some time after enabling backup (I'm not sure that it cause problem, but only action was enabling backups) linstor controller stopped, and now can't start at all.

In controller logs there is database initialization error

root@linstor-controller-5dc46c66f-lpxpv:/# piraeus-entry.sh startController
Importing keystore /tmp/tmp.OEuANOZW6m to /etc/linstor/ssl/keystore.jks...
Entry for alias linstor successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled
Certificate was added to keystore
Importing keystore /tmp/tmp.kvDqZc5i0n to /etc/linstor/https/keystore.jks...
Entry for alias linstor successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled
Certificate was added to keystore
LINSTOR, Module Controller
Version:            1.21.0 (b44bb8d41f264ac1089d9a0a1c540d3cc703d7e8)
Build time:         2023-04-04T10:11:03+00:00
Java Version:       11
Java VM:            Ubuntu, Version 11.0.18+10-post-Ubuntu-0ubuntu122.04
Operating system:   Linux, Version 5.15.0-60-generic
Environment:        amd64, 4 processors, 2994 MiB memory reserved for allocations

System components initialization in progress

Loading configuration file "/etc/linstor/linstor.toml"
09:06:45.226 [main] INFO  LINSTOR/Controller - SYSTEM - ErrorReporter DB first time init.
09:06:45.229 [main] INFO  LINSTOR/Controller - SYSTEM - Log directory set to: '/var/log/linstor-controller'
09:06:45.264 [main] INFO  LINSTOR/Controller - SYSTEM - Database type is Kubernetes-CRD
09:06:45.264 [Main] INFO  LINSTOR/Controller - SYSTEM - Loading API classes started.
09:06:45.811 [Main] INFO  LINSTOR/Controller - SYSTEM - API classes loading finished: 546ms
09:06:45.811 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection started.
09:06:45.825 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
09:06:45.825 [Main] INFO  LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
09:06:45.825 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
09:06:45.836 [Main] INFO  LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
09:06:45.836 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule"
09:06:45.836 [Main] INFO  LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" is not installed
09:06:46.842 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection finished: 1030ms
09:06:46.843 [Main] INFO  LINSTOR/Controller - SYSTEM - Cryptography provider: Using default cryptography module
09:06:47.171 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing authentication subsystem
09:06:47.514 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
09:06:47.515 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing the k8s crd database connector
09:06:47.516 [Main] INFO  LINSTOR/Controller - SYSTEM - Kubernetes-CRD connection URL is "k8s"
09:06:58.410 [Main] ERROR LINSTOR/Controller - SYSTEM - Database initialization error [Report number 6461F624-00000-000000]

09:06:58.415 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutdown in progress
09:06:58.420 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'EbsStatusPoll' of type EbsStatusPoll
09:06:58.421 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'EbsStatusPoll' to complete shutdown
09:06:58.421 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'ScheduleBackupService' of type ScheduleBackupService
09:06:58.422 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'ScheduleBackupService' to complete shutdown
09:06:58.424 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'TaskScheduleService' of type TaskScheduleService
09:06:58.424 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'TaskScheduleService' to complete shutdown
09:06:58.424 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'K8sCrdDatabaseService' of type K8sCrdDatabaseService
09:06:58.432 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'K8sCrdDatabaseService' to complete shutdown
09:06:58.432 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutting down service instance 'TimerEventService' of type TimerEventService
09:06:58.432 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Waiting for service instance 'TimerEventService' to complete shutdown
09:06:58.433 [Thread-2] INFO  LINSTOR/Controller - SYSTEM - Shutdown complete

In error report strange problem with UIDs

ERROR REPORT 6461F624-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.21.0
Build ID:                           b44bb8d41f264ac1089d9a0a1c540d3cc703d7e8
Build time:                         2023-04-04T10:11:03+00:00
Error time:                         2023-05-15 09:06:58
Node:                               linstor-controller-5dc46c66f-lpxpv

============================================================

Reported error:
===============

Description:
    Database initialization error

Category:                           LinStorException
Class name:                         SystemServiceStartException
Class canonical name:               com.linbit.SystemServiceStartException
Generated at:                       Method 'initialize', Source file 'DbK8sCrdInitializer.java', Line #47

Error message:                      Database initialization error

Call backtrace:

    Method                                   Native Class:Line number
    initialize                               N      com.linbit.linstor.dbcp.k8s.crd.DbK8sCrdInitializer:47
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:365
    main                                     N      com.linbit.linstor.core.Controller:613

Caused by:
==========

Category:                           RuntimeException
Class name:                         KubernetesClientException
Class canonical name:               io.fabric8.kubernetes.client.KubernetesClientException
Generated at:                       Method 'requestFailure', Source file 'OperationSupport.java', Line #682

Error message:                      Failure executing: PUT at: https://10.222.0.1/apis/internal.linstor.linbit.com/v1-19-1/layerdrbdresources/93c3755d0c9030cd90f0e6eae6870a8497730f89b7fd0cf231adef048d2b524d. Message: Operation cannot be fulfilled on layerdrbdresources.internal.linstor.linbit.com "93c3755d0c9030cd90f0e6eae6870a8497730f89b7fd0cf231adef048d2b524d": StorageError: invalid object, Code: 4, Key: /registry/internal.linstor.linbit.com/layerdrbdresources/93c3755d0c9030cd90f0e6eae6870a8497730f89b7fd0cf231adef048d2b524d, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 665edfb6-0010-4490-bcdd-473157cef007, UID in object meta: 3b17a3a5-196b-4ae0-bd7f-6dc164fea944. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=internal.linstor.linbit.com, kind=layerdrbdresources, name=93c3755d0c9030cd90f0e6eae6870a8497730f89b7fd0cf231adef048d2b524d, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on layerdrbdresources.internal.linstor.linbit.com "93c3755d0c9030cd90f0e6eae6870a8497730f89b7fd0cf231adef048d2b524d": StorageError: invalid object, Code: 4, Key: /registry/internal.linstor.linbit.com/layerdrbdresources/93c3755d0c9030cd90f0e6eae6870a8497730f89b7fd0cf231adef048d2b524d, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 665edfb6-0010-4490-bcdd-473157cef007, UID in object meta: 3b17a3a5-196b-4ae0-bd7f-6dc164fea944, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).

Call backtrace:

    Method                                   Native Class:Line number
    requestFailure                           N      io.fabric8.kubernetes.client.dsl.base.OperationSupport:682
    requestFailure                           N      io.fabric8.kubernetes.client.dsl.base.OperationSupport:661
    assertResponseCode                       N      io.fabric8.kubernetes.client.dsl.base.OperationSupport:612
    handleResponse                           N      io.fabric8.kubernetes.client.dsl.base.OperationSupport:555
    handleResponse                           N      io.fabric8.kubernetes.client.dsl.base.OperationSupport:518
    handleUpdate                             N      io.fabric8.kubernetes.client.dsl.base.OperationSupport:342
    handleUpdate                             N      io.fabric8.kubernetes.client.dsl.base.OperationSupport:322
    handleUpdate                             N      io.fabric8.kubernetes.client.dsl.base.BaseOperation:649
    lambda$replace$1                         N      io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation:195
    replace                                  N      io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation:200
    replace                                  N      io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation:141
    replace                                  N      io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper:69
    createOrReplace                          N      io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper:61
    createOrReplace                          N      io.fabric8.kubernetes.client.dsl.base.BaseOperation:318
    createOrReplace                          N      io.fabric8.kubernetes.client.dsl.base.BaseOperation:83
    createOrReplace                          N      io.fabric8.kubernetes.client.dsl.base.BaseOperation:308
    createOrReplace                          N      io.fabric8.kubernetes.client.dsl.base.BaseOperation:83
    createOrReplace                          N      com.linbit.linstor.dbdrivers.k8s.K8sCachingClient:89
    createOrReplace                          N      com.linbit.linstor.dbdrivers.k8s.K8sCachingClient:22
    restoreData                              N      com.linbit.linstor.transaction.ControllerK8sCrdRollbackMgr:187
    rollback                                 N      com.linbit.linstor.transaction.ControllerK8sCrdRollbackMgr:147
    rollback                                 N      com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:219
    rollbackIfNeeded                         N      com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:243
    migrate                                  N      com.linbit.linstor.dbcp.k8s.crd.DbK8sCrd:112
    initialize                               N      com.linbit.linstor.dbcp.k8s.crd.DbK8sCrdInitializer:43
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:365
    main                                     N      com.linbit.linstor.core.Controller:613

END OF ERROR REPORT.
duckhawk commented 1 year ago

@ghernadi sorry for mention, Andrew Kvapil (kvaps) said, that you can be interested in this problem

duckhawk commented 1 year ago

Looks like problem is because controller restart due snapshot creating. It creates rollback.internal, then restarted, and after restart I have crashloopbackoff permanently. After applying rollback.internal - DB state goes to normal. Is it possible to add rollback apply if it exists on linstor-controller start?

ghernadi commented 1 year ago

Sorry for the delay, but it took me some time to reproduce this issue.

You mentioned that you enabled backup. Does that also include restoring from a backup once? Because that is the only way I was able to reproduce this issue...

A bit background info: At the beginning of a transaction Linstor creates a rollback object (which you apparently found). That rollback object contains the "to be changed" instances of other objects - including their UIDs from K8s. If anything goes wrong during the commit phase of the transaction, the corresponding k8s instances entries are replaced with the data from rollback entries. In theory that works well.
However, during my database-tests I like to get the database in a specific state, create a backup from that state and clean + restore the database from that backup for every test run to verify my fix is really dealing with the issue properly. The point is, that this "clear + restore" step is implemented by my script in a way to "delete all k8s entries and (re-) apply the entries from my backup". This last re-applying the data causes however K8s to generate new UIDs for the objects. When Linstor then tries to rollback the instances, the entries from the rollback refer to the old UID, causing the mismatch of the shown error.

The reason I went this much into detail here is to make sure that you also had a similar situation. If not, I'd still be interested in how those UID could have changed.

Just to be clear, I am only asking if you executed a restore - not that some error had occured that made restoring necessary. It is quite possible that you created the backup during such a commit phase, which succeeded in the end, but starting the controller with such a restored database would have caused a rollback, also triggering this issue.

If I am right and you did restore the database similar than I did, I have a fix ready for the next release.

duckhawk commented 1 year ago

Yes, looks like problem is solved in 1.24, thank you