[Bug] FE is not starting

Search before asking

[X] I had searched in the issues and found no similar issues.

Version

2.1.2

What's Wrong?

FE pods are all crashing with the following error.


2024-07-02 13:13:25,802 WARN (replayer|87) [Backend.handleHbResponse():731] Backend [id=10057, host=test-be-2.test-be-internal.doris.svc.cluster.local, heartbeatPort=9050, alive=false, lastStartTime=2024-07-02 07:30:41, process epoch=1719905441277, tags: {location=default}] is dead,
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10176, BackendId=10057, version=638, dataSize=30327, rowCount=54, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10200, BackendId=10057, version=638, dataSize=33344, rowCount=112, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10202, BackendId=10057, version=638, dataSize=19125, rowCount=58, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10203, BackendId=10057, version=638, dataSize=32002, rowCount=80, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10177, BackendId=10057, version=638, dataSize=35468, rowCount=69, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10207, BackendId=10057, version=638, dataSize=34593, rowCount=56, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10178, BackendId=10057, version=638, dataSize=30804, rowCount=65, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 1831133, label: label_f42ab2bb810b4755_aeb962138405d8eb, db id: 10002, table id list: 10114, callback id: -1, coordinator: FE: test-fe-1.test-fe-internal.doris.svc.cluster.local, transaction status: COMMITTED, error replicas num: 7, replica ids: 10176,10177,10178,10200,10202, prepare time: 1719911001529, commit time: 1719911002419, finish time: -1, reason: 
2024-07-02 13:13:25,803 INFO (replayer|87) [OlapTable.updateVisibleVersionAndTime():2591] updateVisibleVersionAndTime, tableName: column_statistics, visibleVersion, 639, visibleVersionTime: 1719911003747
2024-07-02 13:13:25,803 INFO (replayer|87) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a VISIBLE transaction TransactionState. transaction id: 1831133, label: label_f42ab2bb810b4755_aeb962138405d8eb, db id: 10002, table id list: 10114, callback id: -1, coordinator: FE: test-fe-1.test-fe-internal.doris.svc.cluster.local, transaction status: VISIBLE, error replicas num: 7, replica ids: 10176,10177,10178,10200,10202, prepare time: 1719911001529, commit time: 1719911002419, finish time: 1719911003747, reason: 
2024-07-02 13:13:25,803 INFO (replayer|87) [LoadManager.replayCreateLoadJob():191] LOAD_JOB=3576204, msg={replay create load job}
2024-07-02 13:13:25,804 INFO (replayer|87) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 1831134, label: label_6ca665869b794ec7_995bc126973a6e4d, db id: 10002, table id list: 10114, callback id: -1, coordinator: FE: test-fe-1.test-fe-internal.doris.svc.cluster.local, transaction status: COMMITTED, error replicas num: 7, replica ids: 10176,10177,10178,10200,10202, prepare time: 1719911005551, commit time: 1719911005819, finish time: -1, reason: 
2024-07-02 13:13:25,804 INFO (replayer|87) [OlapTable.updateVisibleVersionAndTime():2591] updateVisibleVersionAndTime, tableName: column_statistics, visibleVersion, 640, visibleVersionTime: 1719911006142
2024-07-02 13:13:25,804 INFO (replayer|87) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a VISIBLE transaction TransactionState. transaction id: 1831134, label: label_6ca665869b794ec7_995bc126973a6e4d, db id: 10002, table id list: 10114, callback id: -1, coordinator: FE: test-fe-1.test-fe-internal.doris.svc.cluster.local, transaction status: VISIBLE, error replicas num: 7, replica ids: 10176,10177,10178,10200,10202, prepare time: 1719911005551, commit time: 1719911005819, finish time: 1719911006142, reason: 
2024-07-02 13:13:25,804 INFO (replayer|87) [LoadManager.replayCreateLoadJob():191] LOAD_JOB=3576224, msg={replay create load job}
2024-07-02 13:13:25,804 INFO (replayer|87) [Env.setMaster():4125] setMaster MasterInfo:MasterInfo: host=test-fe-1.test-fe-internal.doris.svc.cluster.local httpPort=8030 rpcPort=9020
2024-07-02 13:13:25,805 INFO (replayer|87) [Backend.handleHbResponse():705] Backend [id=10057, host=test-be-2.test-be-internal.doris.svc.cluster.local, heartbeatPort=9050, alive=false, lastStartTime=2024-07-02 07:30:41, process epoch=1719905441277, tags: {location=default}] is back to alive, update start time from 2024-07-02 07:30:41 to 2024-07-02 09:04:31, update be epoch from 1719905441277 to 1719911071949.
2024-07-02 13:13:25,805 WARN (replayer|87) [Backend.handleHbResponse():731] Backend [id=10169, host=test-be-1.test-be-internal.doris.svc.cluster.local, heartbeatPort=9050, alive=false, lastStartTime=2024-07-02 05:13:41, process epoch=1719897221605, tags: {location=default}] is dead,
2024-07-02 13:13:25,805 INFO (replayer|87) [Backend.handleHbResponse():705] Backend [id=10169, host=test-be-1.test-be-internal.doris.svc.cluster.local, heartbeatPort=9050, alive=false, lastStartTime=2024-07-02 05:13:41, process epoch=1719897221605, tags: {location=default}] is back to alive, update start time from 2024-07-02 05:13:41 to 2024-07-02 09:06:16, update be epoch from 1719897221605 to 1719911176797.
2024-07-02 13:13:25,806 INFO (replayer|87) [Env.setMaster():4125] setMaster MasterInfo:MasterInfo: host=test-fe-1.test-fe-internal.doris.svc.cluster.local httpPort=8030 rpcPort=9020
2024-07-02 13:13:25,806 ERROR (replayer|87) [CatalogRecycleBin.replayErasePartition():572] replayErasePartition: partitionInfo is null for partitionId[13762]
2024-07-02 13:13:25,806 ERROR (replayer|87) [EditLog.loadJournal():1231] Operation Type 16
java.lang.NullPointerException: null
    at org.apache.doris.catalog.CatalogRecycleBin.replayErasePartition(CatalogRecycleBin.java:575) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.datasource.InternalCatalog.replayErasePartition(InternalCatalog.java:1813) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.catalog.Env.replayErasePartition(Env.java:3090) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:289) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.catalog.Env.replayJournal(Env.java:2759) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.catalog.Env$4.runOneCycle(Env.java:2533) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]

What You Expected?

FE pods should be running.

How to Reproduce?

We have been operating doris clusteer(2FE and 3BE) without any issues for a few weeks. We deployed it using selectdb/doris-operator. We faced some issues with BE pods because of volume filling up and computing resource lack. We used to increase the resource spec whenever this kinda alerts fired and it worked after that.

This time, all resources are enough i think but FE pods are not up.

Anything Else?

Resource spec;

  be:
      limits:
        cpu: 4
        memory: 16Gi
      requests:
        cpu: 4
        memory: 16Gi
  fe:
      requests:
        cpu: 2
        memory: 8Gi
      limits:
        cpu: 2
        memory: 8Gi

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

apache / doris

[Bug] FE is not starting #37177