apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.7k stars 3.28k forks source link

[Bug] Doris Operator Cluster FE not start after reboot fe and be #43751

Open ming12713 opened 3 days ago

ming12713 commented 3 days ago

Search before asking

Version

2.1.1

What's Wrong?

2024-11-12 08:05:14,342 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 3917246, label: nome_raw_dataKC_ods_vtc_nome_raw_data__KC_1KC_loshu_ods_vtc_nome_raw_dataKC_0__KC_1612525KC_1730487305321, db id: 11154, table id list: 74966, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1730487305400, commit time: 1730487308568, finish time: -1, reason: /opt/apache-doris/fe/bin/start_fe.sh: line 265: 162 Killed ${LIMIT:+${LIMIT}} "${JAVA}" ${final_java_opt:+${final_java_opt}} -XX:-OmitStackTraceInFastThrow -XX:OnOutOfMemoryError="kill -9 %p" ${coverage_opt:+${coverage_opt}} org.apache.doris.DorisFE ${HELPER:+${HELPER}} ${OPT_VERSION:+${OPT_VERSION}} "${METADATA_FAILURE_RECOVERY}" "$@" < /dev/null

Doris Installation via Operator, 1 BE Node and 1 FE Node, After restarting both the Doris FE and BE nodes, the FE node fails to start normally and reports the error mentioned above. The BE IP 10.42.1.19 mentioned in the error is the previous BE pod IP, not the SVC IP. The FE configuration for service discovery is set to use SVC (Service) method, but now the BE is 10.42.1.6.

image

pod network cidr 10.42.1.x/16 image

svc network cidr 10.43.48.x image

What You Expected?

fixe issues

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

Code of Conduct

ming12713 commented 3 days ago

In my Case, Kafka Writes to Doris via Connector Sink Mode, When Doris is restarted, the connector continues to write data. The logs parse the coordinator BE IP. Is it possible that the connector is using the StreamLoad method to write data? This data is synchronized to the FE meta with bdb, but it has not yet been synchronized to the BE. If the BE is restarted at this moment, the FE may negotiate a BE coordinator IP that it cannot connect to, causing cluster issues. Is my understanding correct?