Open huyangg opened 2 years ago
复制现有doris环境整个目录,在新环境启动导致原环境的be服务宕机。
复制现有doris环境整个目录,在新环境启动,观察原环境be状态。前提条件:新环境和原环境网络互通。
Palo version 0.14.13.1-Unknown
单节点和多节点环境。新环境在 fe.conf 中添加配置:metadata_failure_recovery=true。
dmesg -T无OOM信息。 原环境be alive状态为false ,ErrMsg为 epoch is not greater than local. ignore heartbeat. MySQL [(none)]> SHOW PROC '/backends'; +-----------+-----------------+---------------+---------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+----------------------------------------------------+-------------------+-------------------------------------------------------------------------------------------+ | BackendId | Cluster | IP | HostName | HeartbeatPort | BePort | HttpPort | BrpcPort | LastStartTime | LastHeartbeat | Alive | SystemDecommissioned | ClusterDecommissioned | TabletNum | DataUsedCapacity | AvailCapacity | TotalCapacity | UsedPct | MaxDiskUsedPct | ErrMsg | Version | Status | +-----------+-----------------+---------------+---------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+----------------------------------------------------+-------------------+-------------------------------------------------------------------------------------------+ | 10003 | default_cluster | 172.16.1.201 | 172.16.1.201 | 9050 | 9060 | 8040 | 8060 | 2021-10-20 21:34:31 | 2022-03-10 15:27:53 | false | false | false | 837 | 1.442 GB | 138.925 GB | 191.024 GB | 27.27 % | 27.27 % | epoch is not greater than local. ignore heartbeat. | 0.14.13.1-Unknown | {"lastSuccessReportTabletsTime":"2022-03-10 15:27:22","lastStreamLoadTime":1645584570768} | +-----------+-----------------+---------------+---------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+----------------------------------------------------+-------------------+-------------------------------------------------------------------------------------------+ **be.info.log信息:** I0310 15:27:53.379778 3011 plan_fragment_executor.cpp:583] Close() fragment_instance_id=65ac48000aed4ecc-9b947eb57686de25 I0310 15:27:54.298195 10114 heartbeat_server.cpp:58] get heartbeat from FE.host:172.16.2.113, port:9020, cluster id:138756675, counter:2424613 I0310 15:27:54.298223 10114 heartbeat_server.cpp:120] master change. new master host: 172.16.2.113. port: 9020. epoch: 8 I0310 15:27:54.298228 10114 heartbeat_server.cpp:166] Master FE is changed or restarted. report tablet and disk info immediately I0310 15:27:54.298241 10114 task_worker_pool.cpp:258] notify task worker pool: TaskWorkerPool.REPORT_DISK_STATE I0310 15:27:54.298250 10114 task_worker_pool.cpp:258] notify task worker pool: TaskWorkerPool.REPORT_OLAP_TABLE I0310 15:27:54.298363 3128 data_dir.cpp:837] path: /root/DORIS-0.14.7-release/be/storage total capacity: 1064086802432, available capacity: 953151504384 I0310 15:27:54.299175 3129 tablet_manager.cpp:880] begin to build all report tablets info I0310 15:27:54.299291 3129 tablet_manager.cpp:885] find expired transactions for 0 tablets I0310 15:27:54.299764 3128 storage_engine.cpp:373] get root path info cost: 1 ms. tablet counter: 2087 I0310 15:27:54.300318 10115 backend_service.cpp:325] get_batch stream_load_record rocksdb successfully. records size: 0, last_stream_load_timestamp: 1645584507086 I0310 15:27:54.305917 3129 tablet_manager.cpp:922] success to build all report tablets info. tablet_count=2087 I0310 15:27:54.353857 3128 task_worker_pool.cpp:1587] finish report DISK. master host: 172.16.2.113, port: 9020 I0310 15:27:54.361510 3129 task_worker_pool.cpp:1587] finish report TABLET. master host: 172.16.2.113, port: 9020 I0310 15:27:57.650785 3127 task_worker_pool.cpp:1587] finish report TASK. master host: 172.16.2.113, port: 9020 I0310 15:27:57.753644 3063 storage_engine.cpp:625] start trash and snapshot sweep. I0310 15:27:57.755581 3063 storage_engine.cpp:373] get root path info cost: 1 ms. tablet counter: 2087 I0310 15:27:57.755627 3063 storage_engine.cpp:647] Start to sweep path /root/DORIS-0.14.7-release/be/storage W0310 15:27:58.920964 3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7 W0310 15:28:03.929098 3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7 I0310 15:28:07.652289 3127 task_worker_pool.cpp:1587] finish report TASK. master host: 172.16.2.113, port: 9020 W0310 15:28:08.936048 3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7 W0310 15:28:13.945868 3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7 I0310 15:28:17.653152 3127 task_worker_pool.cpp:1587] finish report TASK. master host: 172.16.2.113, port: 9020 I0310 15:28:18.727761 3061 load_channel_mgr.cpp:241] cleaning timed out load channels I0310 15:28:18.727794 3061 load_channel_mgr.cpp:274] load mem consumption(bytes). limit: 86418309775, current: 0, peak: 1241120388 W0310 15:28:18.952822 3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7 W0310 15:28:23.958277 3247 heartbeat_server.cpp:125] epoch is not greater than local. ignore heartbeat. host: 172.16.2.113 port: 9020 local epoch: 8 received epoch: 7 可以看到fe发生切换,172.16.2.113地址就是新环境地址,发现问题后急忙停止了新环境fe。
停止新环境fe,重启原环境be服务。观察服务状态正常,业务正常。
so the issue is resolved ?
我从0.14升级1.1.2,需要0.15中间版本过渡,在0.14 fe升级 0.15过程中也遇到这个问题,现在这个问题怎么解决?
+1,先升级fe,再升级be的过程会碰到这个问题
问题描述:
问题复现的case:
Doris版本:
Doris集群基本信息:
异常信息:
解决方案(社区技术人员或者其他用户给出的回复解决方案)