StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.66k stars 1.75k forks source link

FE crash - "failed to get DB names for 1 times!Got RestartRequiredException" #39871

Closed milletnis closed 4 days ago

milletnis commented 7 months ago

After couple of days 1 FE crashed with the following log - the error indicates this exception is per design? is this correct??

2024-01-24 01:30:33,965 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [GlobalStateMgr.loadImage():1580] Success load StarRocks meta block 20 from image
2024-01-24 01:30:33,983 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [GlobalStateMgr.loadImage():1580] Success load StarRocks meta block 21 from image
2024-01-24 01:30:33,983 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [CatalogMgr.loadResourceMappingCatalog():336] start to replay resource mapping catalog
2024-01-24 01:30:33,984 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [CatalogMgr.loadResourceMappingCatalog():355] finished replaying resource mapping catalogs from resources
2024-01-24 01:30:33,984 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [GlobalStateMgr.loadImage():1580] Success load StarRocks meta block 22 from image
2024-01-24 01:30:33,988 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [GlobalStateMgr.loadImage():1580] Success load StarRocks meta block 23 from image
2024-01-24 01:30:33,990 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [GlobalStateMgr.loadImage():1580] Success load StarRocks meta block 24 from image
2024-01-24 01:30:33,991 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [GlobalStateMgr.loadImage():1580] Success load StarRocks meta block 25 from image
2024-01-24 01:30:33,991 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [GlobalStateMgr.loadImage():1580] Success load StarRocks meta block 26 from image
2024-01-24 01:30:33,991 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [GlobalStateMgr.loadImage():1580] Success load StarRocks meta block 27 from image
2024-01-24 01:30:33,999 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [GlobalStateMgr.loadImage():1580] Success load StarRocks meta block 28 from image
2024-01-24 01:30:34,007 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [GlobalStateMgr.processMvRelatedMeta():1710] finish processing all tables' related materialized views in 6ms
2024-01-24 01:30:34,007 INFO (UNKNOWN starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(-1)|1) [GlobalStateMgr.loadImage():1690] finished to load image in 20765 ms
2024-01-24 01:30:34,014 INFO (stateChangeExecutor|72) [StateChangeExecutor.runOneCycle():85] begin to transfer FE type from INIT to UNKNOWN
2024-01-24 01:30:34,015 INFO (stateChangeExecutor|72) [StateChangeExecutor.runOneCycle():179] finished to transfer FE type from INIT to UNKNOWN
2024-01-24 01:30:34,015 INFO (stateChangeExecutor|72) [StateChangeExecutor.runOneCycle():85] begin to transfer FE type from INIT to FOLLOWER
2024-01-24 01:30:34,024 INFO (replayer|83) [GlobalStateMgr$5.runOneCycle():2189] start to replay from 70243199
2024-01-24 01:30:34,026 WARN (replayer|83) [BDBJournalCursor.wrapDatabaseException():85] failed to get DB names for 1 times!Got RestartRequiredException, will exit.
com.sleepycat.je.rep.RollbackException: (JE 18.3.16) Environment must be closed, caused by: com.sleepycat.je.rep.RollbackException: Environment invalid because of previous exception: (JE 18.3.16) starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb Node starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb must rollback 2 total commits(1 of which were durable) to the earliest point indicated by transaction id=-60820067 time=2024-01-24 01:20:58.389 vlsn=129,934,919 lsn=0x1bfc/0xf6aa3 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated.  Log files were truncated to file 0x7164, offset 0x1009545, vlsn 129,934,917 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2) Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2) Environment invalid because of previous exception: (JE 18.3.16) starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb Node starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb must rollback 2 total commits(1 of which were durable) to the earliest point indicated by transaction id=-60820067 time=2024-01-24 01:20:58.389 vlsn=129,934,919 lsn=0x1bfc/0xf6aa3 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated.  Log files were truncated to file 0x7164, offset 0x1009545, vlsn 129,934,917 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2) Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2)
    at com.sleepycat.je.rep.RollbackException.wrapSelf(RollbackException.java:146) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.RollbackException.wrapSelf(RollbackException.java:62) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1835) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1844) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.Environment.checkOpen(Environment.java:2697) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.Environment.getDatabaseNames(Environment.java:2455) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.starrocks.journal.bdbje.BDBEnvironment.getDatabaseNamesWithPrefix(BDBEnvironment.java:478) ~[starrocks-fe.jar:?]
    at com.starrocks.journal.bdbje.BDBJournalCursor.refresh(BDBJournalCursor.java:177) ~[starrocks-fe.jar:?]
    at com.starrocks.journal.bdbje.BDBJournalCursor.getJournalCursor(BDBJournalCursor.java:126) ~[starrocks-fe.jar:?]
    at com.starrocks.journal.bdbje.BDBJEJournal.read(BDBJEJournal.java:137) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr$5.runOneCycle(GlobalStateMgr.java:2190) ~[starrocks-fe.jar:?]
    at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr$5.run(GlobalStateMgr.java:2260) ~[starrocks-fe.jar:?]
Caused by: com.sleepycat.je.rep.RollbackException: Environment invalid because of previous exception: (JE 18.3.16) starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb Node starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb must rollback 2 total commits(1 of which were durable) to the earliest point indicated by transaction id=-60820067 time=2024-01-24 01:20:58.389 vlsn=129,934,919 lsn=0x1bfc/0xf6aa3 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated.  Log files were truncated to file 0x7164, offset 0x1009545, vlsn 129,934,917 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2) Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2)
    at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.setupHardRecovery(ReplicaFeederSyncup.java:721) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.verifyRollback(ReplicaFeederSyncup.java:417) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.execute(ReplicaFeederSyncup.java:164) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:732) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:485) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:412) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1869) ~[starrocks-bdb-je-18.3.16.jar:?]
2024-01-24 01:30:34,105 WARN (replayer|83) [GlobalStateMgr$5.runOneCycle():2198] got interrupt exception or inconsistent exception when replay journal 70243200, will exit,
com.starrocks.journal.JournalInconsistentException: failed to get DB names for 1 times!Got RestartRequiredException, will exit.
    at com.starrocks.journal.bdbje.BDBJournalCursor.wrapDatabaseException(BDBJournalCursor.java:91) ~[starrocks-fe.jar:?]
    at com.starrocks.journal.bdbje.BDBJournalCursor.refresh(BDBJournalCursor.java:181) ~[starrocks-fe.jar:?]
    at com.starrocks.journal.bdbje.BDBJournalCursor.getJournalCursor(BDBJournalCursor.java:126) ~[starrocks-fe.jar:?]
    at com.starrocks.journal.bdbje.BDBJEJournal.read(BDBJEJournal.java:137) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr$5.runOneCycle(GlobalStateMgr.java:2190) ~[starrocks-fe.jar:?]
    at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
    at com.starrocks.server.GlobalStateMgr$5.run(GlobalStateMgr.java:2260) ~[starrocks-fe.jar:?]
Caused by: com.sleepycat.je.rep.RollbackException: (JE 18.3.16) Environment must be closed, caused by: com.sleepycat.je.rep.RollbackException: Environment invalid because of previous exception: (JE 18.3.16) starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb Node starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb must rollback 2 total commits(1 of which were durable) to the earliest point indicated by transaction id=-60820067 time=2024-01-24 01:20:58.389 vlsn=129,934,919 lsn=0x1bfc/0xf6aa3 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated.  Log files were truncated to file 0x7164, offset 0x1009545, vlsn 129,934,917 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2) Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2) Environment invalid because of previous exception: (JE 18.3.16) starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb Node starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb must rollback 2 total commits(1 of which were durable) to the earliest point indicated by transaction id=-60820067 time=2024-01-24 01:20:58.389 vlsn=129,934,919 lsn=0x1bfc/0xf6aa3 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated.  Log files were truncated to file 0x7164, offset 0x1009545, vlsn 129,934,917 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2) Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2)
    at com.sleepycat.je.rep.RollbackException.wrapSelf(RollbackException.java:146) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.RollbackException.wrapSelf(RollbackException.java:62) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1835) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1844) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.Environment.checkOpen(Environment.java:2697) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.Environment.getDatabaseNames(Environment.java:2455) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.starrocks.journal.bdbje.BDBEnvironment.getDatabaseNamesWithPrefix(BDBEnvironment.java:478) ~[starrocks-fe.jar:?]
    at com.starrocks.journal.bdbje.BDBJournalCursor.refresh(BDBJournalCursor.java:177) ~[starrocks-fe.jar:?]
    ... 5 more
Caused by: com.sleepycat.je.rep.RollbackException: Environment invalid because of previous exception: (JE 18.3.16) starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb Node starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2):/opt/starrocks/fe/meta/bdb must rollback 2 total commits(1 of which were durable) to the earliest point indicated by transaction id=-60820067 time=2024-01-24 01:20:58.389 vlsn=129,934,919 lsn=0x1bfc/0xf6aa3 durable=false in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated.  Log files were truncated to file 0x7164, offset 0x1009545, vlsn 129,934,917 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2) Originally thrown by HA thread: REPLICA starrocks-global-fe-1.starrocks-global-fe-search.starrocks-fra1-prod.svc.cluster.local_9010_1705090426631(2)
    at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.setupHardRecovery(ReplicaFeederSyncup.java:721) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.verifyRollback(ReplicaFeederSyncup.java:417) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.execute(ReplicaFeederSyncup.java:164) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:732) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:485) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:412) ~[starrocks-bdb-je-18.3.16.jar:?]
    at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1869) ~[starrocks-bdb-je-18.3.16.jar:?]
[2024-01-24 01:30:34] failed to get DB names for 1 times!Got RestartRequiredException, will exit.

StarRocks version (Required)

Minnn0312 commented 7 months ago

We encountered an issue with the CrashLoopBackOff Pod FrontEnd during operation, causing the system to be stuck and has not been resolved.

gengjun-git commented 6 months ago

You can start again for the com.sleepycat.je.rep.RollbackException

Minnn0312 commented 6 months ago

I resolved this error by increasing startupProbe and livenessProbe time in StatefulSet FrontEnd

github-actions[bot] commented 2 weeks ago

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!