apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.77k stars 3.29k forks source link

复制现有doris环境整个目录,在新环境启动导致原环境的be服务宕机 #8446

Open huyangg opened 2 years ago

huyangg commented 2 years ago

问题描述:

复制现有doris环境整个目录,在新环境启动导致原环境的be服务宕机。

问题复现的case:

复制现有doris环境整个目录,在新环境启动,观察原环境be状态。前提条件:新环境和原环境网络互通。

Doris版本:

Palo version 0.14.13.1-Unknown

Doris集群基本信息:

单节点和多节点环境。新环境在 fe.conf 中添加配置:metadata_failure_recovery=true。

异常信息:

dmesg -T无OOM信息。

原环境be  alive状态为false ,ErrMsg为 epoch is not greater than local. ignore heartbeat.
MySQL [(none)]> SHOW PROC '/backends';
+-----------+-----------------+---------------+---------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+----------------------------------------------------+-------------------+-------------------------------------------------------------------------------------------+
| BackendId | Cluster         | IP            | HostName      | HeartbeatPort | BePort | HttpPort | BrpcPort | LastStartTime       | LastHeartbeat       | Alive | SystemDecommissioned | ClusterDecommissioned | TabletNum | DataUsedCapacity | AvailCapacity | TotalCapacity | UsedPct | MaxDiskUsedPct | ErrMsg                                             | Version           | Status                                                                                    |
+-----------+-----------------+---------------+---------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+----------------------------------------------------+-------------------+-------------------------------------------------------------------------------------------+
| 10003     | default_cluster | 172.16.1.201 | 172.16.1.201 | 9050          | 9060   | 8040     | 8060     | 2021-10-20 21:34:31 | 2022-03-10 15:27:53 | false | false                | false                 | 837       | 1.442 GB         | 138.925 GB    | 191.024 GB    | 27.27 % | 27.27 %        | epoch is not greater than local. ignore heartbeat. | 0.14.13.1-Unknown | {"lastSuccessReportTabletsTime":"2022-03-10 15:27:22","lastStreamLoadTime":1645584570768} |
+-----------+-----------------+---------------+---------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+----------------------------------------------------+-------------------+-------------------------------------------------------------------------------------------+

**be.info.log信息:**
I0310 15:27:53.379778  3011 plan_fragment_executor.cpp:583] Close() fragment_instance_id=65ac48000aed4ecc-9b947eb57686de25
I0310 15:27:54.298195 10114 heartbeat_server.cpp:58] get heartbeat from FE.host:172.16.2.113, port:9020, cluster id:138756675, counter:2424613
I0310 15:27:54.298223 10114 heartbeat_server.cpp:120] master change. new master host: 172.16.2.113. port: 9020. epoch: 8
I0310 15:27:54.298228 10114 heartbeat_server.cpp:166] Master FE is changed or restarted. report tablet and disk info immediately
I0310 15:27:54.298241 10114 task_worker_pool.cpp:258] notify task worker pool: TaskWorkerPool.REPORT_DISK_STATE
I0310 15:27:54.298250 10114 task_worker_pool.cpp:258] notify task worker pool: TaskWorkerPool.REPORT_OLAP_TABLE
I0310 15:27:54.298363  3128 data_dir.cpp:837] path: /root/DORIS-0.14.7-release/be/storage total capacity: 1064086802432, available capacity: 953151504384
I0310 15:27:54.299175  3129 tablet_manager.cpp:880] begin to build all report tablets info
I0310 15:27:54.299291  3129 tablet_manager.cpp:885] find expired transactions for 0 tablets
I0310 15:27:54.299764  3128 storage_engine.cpp:373] get root path info cost: 1 ms. tablet counter: 2087
I0310 15:27:54.300318 10115 backend_service.cpp:325] get_batch stream_load_record rocksdb successfully. records size: 0, last_stream_load_timestamp: 1645584507086
I0310 15:27:54.305917  3129 tablet_manager.cpp:922] success to build all report tablets info. tablet_count=2087
I0310 15:27:54.353857  3128 task_worker_pool.cpp:1587] finish report DISK. master host: 172.16.2.113, port: 9020
I0310 15:27:54.361510  3129 task_worker_pool.cpp:1587] finish report TABLET. master host: 172.16.2.113, port: 9020
I0310 15:27:57.650785  3127 task_worker_pool.cpp:1587] finish report TASK. master host: 172.16.2.113, port: 9020
I0310 15:27:57.753644  3063 storage_engine.cpp:625] start trash and snapshot sweep.
I0310 15:27:57.755581  3063 storage_engine.cpp:373] get root path info cost: 1 ms. tablet counter: 2087
I0310 15:27:57.755627  3063 storage_engine.cpp:647] Start to sweep path /root/DORIS-0.14.7-release/be/storage
W0310 15:27:58.920964  3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7
W0310 15:28:03.929098  3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7
I0310 15:28:07.652289  3127 task_worker_pool.cpp:1587] finish report TASK. master host: 172.16.2.113, port: 9020
W0310 15:28:08.936048  3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7
W0310 15:28:13.945868  3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7
I0310 15:28:17.653152  3127 task_worker_pool.cpp:1587] finish report TASK. master host: 172.16.2.113, port: 9020
I0310 15:28:18.727761  3061 load_channel_mgr.cpp:241] cleaning timed out load channels
I0310 15:28:18.727794  3061 load_channel_mgr.cpp:274] load mem consumption(bytes). limit: 86418309775, current: 0, peak: 1241120388
W0310 15:28:18.952822  3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7
W0310 15:28:23.958277  3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7
可以看到fe发生切换,172.16.2.113地址就是新环境地址,发现问题后急忙停止了新环境fe。

解决方案(社区技术人员或者其他用户给出的回复解决方案)

停止新环境fe,重启原环境be服务。观察服务状态正常,业务正常。
dataalive commented 2 years ago

so the issue is resolved ?

maythorn300 commented 1 year ago

我从0.14升级1.1.2,需要0.15中间版本过渡,在0.14 fe升级 0.15过程中也遇到这个问题,现在这个问题怎么解决?

Level1Accelerator commented 1 year ago

+1,先升级fe,再升级be的过程会碰到这个问题