StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.19k stars 1.83k forks source link

heartbeat_server.cpp:74] Fail to handle heartbeat: Internal error: BE is shutting down cached master #12496

Closed Lordeath closed 2 years ago

Lordeath commented 2 years ago

lastest be.INFO:

I1025 22:42:48.446453 100538 heartbeat_server.cpp:63] get heartbeat from FE.host:172.17.0.1, port:9020, cluster id:1606073905, counter:361
I1025 22:43:00.296890 100081 daemon.cpp:191] Current memory statistics: process(187383744), query_pool(0), load(0), metadata(5560903), compaction(0), schema_change(0), column_pool(28497804), page_cache(18566016), update(650326), chunk_allocator(10896512), clone(0), consistency(0)
I1025 22:43:15.298213 100081 daemon.cpp:191] Current memory statistics: process(187383744), query_pool(0), load(0), metadata(5560903), compaction(0), schema_change(0), column_pool(28497804), page_cache(18566016), update(650326), chunk_allocator(10896512), clone(0), consistency(0)
I1025 22:43:30.299477 100081 daemon.cpp:191] Current memory statistics: process(187383744), query_pool(0), load(0), metadata(5560903), compaction(0), schema_change(0), column_pool(28497804), page_cache(18566016), update(650326), chunk_allocator(10896512), clone(0), consistency(0)
I1025 22:43:36.532577 100453 tablet_manager.cpp:566] Found the best tablet to compact. compaction_type=cumulative tablet_id=10087 highest_score=2
I1025 22:43:45.300323 100081 daemon.cpp:191] Current memory statistics: process(187383744), query_pool(0), load(0), metadata(5560903), compaction(0), schema_change(0), column_pool(28497804), page_cache(18566016), update(650326), chunk_allocator(10896512), clone(0), consistency(0)
I1025 22:43:45.480614 100447 load_channel_mgr.cpp:181] Memory consumption(bytes) limit=4042921807 current=0 peak=124325488
I1025 22:43:47.542276 100422 tablet_manager.cpp:794] Report all 34 tablets info
I1025 22:43:48.537669 100538 heartbeat_server.cpp:63] get heartbeat from FE.host:172.17.0.1, port:9020, cluster id:1606073905, counter:373
I1025 22:44:00.301229 100081 daemon.cpp:191] Current memory statistics: process(187383744), query_pool(0), load(0), metadata(5560903), compaction(0), schema_change(0), column_pool(28497804), page_cache(18566016), update(650326), chunk_allocator(10896512), clone(0), consistency(0)
I1025 22:44:15.302415 100081 daemon.cpp:191] Current memory statistics: process(187383744), query_pool(0), load(0), metadata(5560903), compaction(0), schema_change(0), column_pool(28497804), page_cache(18566016), update(650326), chunk_allocator(10896512), clone(0), consistency(0)
I1025 22:44:18.551820 100058 server.cpp:1100] Server[starrocks::BackendInternalServiceImpl<starrocks::PInternalService>+starrocks::LakeServiceImpl+starrocks::BackendInternalServiceImpl<doris::PBackendService>] is going to quit
W1025 22:44:18.592672 100538 heartbeat_server.cpp:74] Fail to handle heartbeat: Internal error: BE is shutting down cached master info: TMasterInfo(network_address=TNetworkAddress(hostname=172.17.0.1, port=9020), cluster_id=1606073905, epoch=1, token=346ec20e-ce27-4442-afcd-5b8dbfe9f0eb, backend_ip=172.17.0.1, http_port=8030, heartbeat_flags=0, backend_id=10002) received master info: TMasterInfo(network_address=TNetworkAddress(hostname=172.17.0.1, port=9020), cluster_id=1606073905, epoch=1, token=346ec20e-ce27-4442-afcd-5b8dbfe9f0eb, backend_ip=172.17.0.1, http_port=8030, heartbeat_flags=0, backend_id=10002)

found this bug on different computers So is my shell script wrong?

Steps to reproduce the behavior (Required)

i just run my x.sh:

if [ ! -f ~/StarRocks-2.4.0.tar.gz ]; then
  echo "自动下载安装 StarRocks-2.4.0.tar.gz"
  wget "https://download.starrocks.com/zh-CN/download/request-download/56/StarRocks-2.4.0.tar.gz" -O ~/StarRocks-2.4.0.tar.gz
  tar -xzvf ~/StarRocks-2.4.0.tar.gz -C /opt/
  # 建立一个快捷方式,后面的操作都用快捷方式来操作
  rm -rf /opt/starRocks
  ln -s /opt/StarRocks-2.4.0 /opt/starRocks
fi

# fe配置文件
echo FE配置文件:
mv -f /opt/starRocks/fe/conf/fe.conf /opt/starRocks/fe/conf/fe.conf.source
tee -a /opt/starRocks/fe/conf/fe.conf <<EOF
LOG_DIR = ${STARROCKS_HOME}/log
DATE = "$(date +%Y%m%d-%H%M%S)"
JAVA_OPTS="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:$STARROCKS_HOME/log/fe.gc.log.$DATE"
JAVA_OPTS_FOR_JDK_9="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xlog:gc*:$STARROCKS_HOME/log/fe.gc.log.$DATE:time"
sys_log_level = INFO
http_port = 8030
rpc_port = 9020
query_port = 9030
edit_log_port = 9010
mysql_service_nio_enabled = true
priority_networks=172.17.0.0/24
EOF

echo BE配置文件:
mv -f /opt/starRocks/be/conf/be.conf /opt/starRocks/be/conf/be.conf.source
tee -a /opt/starRocks/be/conf/be.conf <<EOF
sys_log_level = INFO
be_port = 9060
webserver_port = 8040
heartbeat_service_port = 9050
brpc_port = 8060
default_rowset_type = beta
priority_networks=172.17.0.0/24
EOF

echo 操作FE
sh /opt/starRocks/fe/bin/stop_fe.sh --daemon
mkdir -p /opt/starRocks/fe/log/
rm -rf /opt/starRocks/fe/log/*
mkdir -p /opt/starRocks/fe/meta/
rm -rf /opt/starRocks/fe/meta/*
sh /opt/starRocks/fe/bin/start_fe.sh --daemon
grep -E "FeServer|NMysqlServer|QeService" /opt/starRocks/fe/log/fe.log

xxx=$(grep -E "FeServer|NMysqlServer|QeService" /opt/starRocks/fe/log/fe.log)
while [ -z "$xxx" ]; do
  sleep 1
  xxx=$(grep -E "FeServer|NMysqlServer|QeService" /opt/starRocks/fe/log/fe.log)
  echo "$xxx"
done

mysql -h 127.0.0.1 -P9030 -uroot -e"SET PASSWORD = PASSWORD('XXXXXXXX')"

mysql -h 127.0.0.1 -P9030 -uroot -pXXXXXXXX -e'ALTER SYSTEM ADD BACKEND "172.17.0.1:9050"'
sh /opt/starRocks/be/bin/stop_be.sh --daemon
mkdir -p /opt/starRocks/be/log/
rm -rf /opt/starRocks/be/log/*
mkdir -p /opt/starRocks/be/storage/
rm -rf /opt/starRocks/be/storage/*
sh /opt/starRocks/be/bin/start_be.sh --daemon
cat /opt/starRocks/be/log/be.INFO
mysql -h 127.0.0.1 -P9030 -uroot -pXXXXXXXX -e'SHOW backends'

xxx=$(mysql -h 127.0.0.1 -P9030 -uroot -pXXXXXXXX -e'SHOW backends')
yyy=$(echo "$xxx" | grep -v 'N/A')
while [ -z "$yyy" ]; do
  sleep 1
  xxx=$(mysql -h 127.0.0.1 -P9030 -uroot -pXXXXXXXX -e'SHOW backends')
  yyy=$(echo "$xxx" | grep -v 'N/A')
  echo "$xxx"
  echo "$yyy"
done

if [ ! -f ~/smt.tar.gz ]; then
  wget "https://cdn-thirdparty.starrocks.com/smt.tar.gz" -O ~/smt.tar.gz
fi
if [ ! -f ~/flink-1.14.6-bin-scala_2.12.tgz ]; then
  wget "http://mirrors.cloud.tencent.com/apache/flink/flink-1.14.6/flink-1.14.6-bin-scala_2.12.tgz" -O ~/flink-1.14.6-bin-scala_2.12.tgz
fi
if [ ! -f ~/flink-connector-starrocks-1.2.3_flink-1.14_2.12.jar ]; then
  wget "https://github.com/StarRocks/starrocks-connector-for-apache-flink/releases/download/v1.2.3/flink-connector-starrocks-1.2.3_flink-1.14_2.12.jar" -O ~/flink-connector-starrocks-1.2.3_flink-1.14_2.12.jar
fi
if [ ! -f ~/flink-sql-connector-mysql-cdc-2.2.0.jar ]; then
  wget "https://repo1.maven.org/maven2/com/ververica/flink-sql-connector-mysql-cdc/2.2.0/flink-sql-connector-mysql-cdc-2.2.0.jar" -O ~/flink-sql-connector-mysql-cdc-2.2.0.jar
fi
if [ ! -f ~/mysql-connector-java-8.0.30.jar ]; then
  wget "https://repo.maven.apache.org/maven2/mysql/mysql-connector-java/8.0.30/mysql-connector-java-8.0.30.jar" -O ~/mysql-connector-java-8.0.30.jar
fi

if [ ! -f ~/flink-sql-connector-sqlserver-cdc-2.2.1.jar ]; then
  wget "https://repo1.maven.org/maven2/com/ververica/flink-sql-connector-sqlserver-cdc/2.2.1/flink-sql-connector-sqlserver-cdc-2.2.1.jar" -O ~/flink-sql-connector-sqlserver-cdc-2.2.1.jar
fi

mkdir /opt/flink -p
tar -xzvf ~/smt.tar.gz -C /opt/flink
rm -rf /opt/flink/flink-1.14.6
tar -xzvf ~/flink-1.14.6-bin-scala_2.12.tgz -C /opt/flink

\cp -r ~/flink-connector-starrocks-1.2.3_flink-1.14_2.12.jar /opt/flink/flink-1.14.6/lib/
\cp -r ~/flink-sql-connector-mysql-cdc-2.2.0.jar /opt/flink/flink-1.14.6/lib/
\cp -r ~/mysql-connector-java-8.0.30.jar /opt/flink/flink-1.14.6/lib/
\cp -r ~/flink-sql-connector-sqlserver-cdc-2.2.1.jar /opt/flink/flink-1.14.6/lib/

mv -f /opt/flink/smt/conf/config_prod.conf /opt/flink/smt/conf/config_prod.conf.source
tee -a /opt/flink/smt/conf/config_prod.conf <<EOF
[db]
host = XXXX
port = XXXXX
user = XXXXX
password = XXXXX
type = sqlserver

[other]
be_num = 1
use_decimal_v3 = true
output_dir = ./result

[table-rule.1]
database = XXXXXXX
table = XXXXXXXXXXX
schema = dbo

flink.starrocks.jdbc-url=jdbc:mysql://192.168.4.67:9030
flink.starrocks.load-url=192.168.4.67:8030
flink.starrocks.username=root
flink.starrocks.password=XXXXXXXX
flink.starrocks.sink.max-retries=10
flink.starrocks.sink.buffer-flush.interval-ms=1000
flink.starrocks.sink.properties.format=json
flink.starrocks.sink.properties.strip_outer_array=true
EOF

cat /opt/flink/smt/conf/config_prod.conf

/opt/flink/smt/starrocks-migrate-tool

cat /opt/flink/smt/result/starrocks-create.all.sql
mysql -h 127.0.0.1 -P9030 -uroot -pXXXXXXXX </opt/flink/smt/result/starrocks-create.all.sql

grep -v 'taskmanager.numberOfTaskSlots' /opt/flink/flink-1.14.6/conf/flink-conf.yaml | grep -v 'taskmanager.memory.process.size' >/opt/flink/flink-1.14.6/conf/flink-conf.yaml2
{
  echo taskmanager.numberOfTaskSlots: 40
  echo taskmanager.memory.process.size: 5096m
  echo execution.checkpointing.interval: 10000
  echo state.backend: filesystem
  echo state.checkpoints.dir: file:///opt/flink/flink-checkpoints-directory
  echo state.savepoints.dir: file:///opt/flink/flink-savepoints-directory
} >>/opt/flink/flink-1.14.6/conf/flink-conf.yaml2

mv -f /opt/flink/flink-1.14.6/conf/flink-conf.yaml2 /opt/flink/flink-1.14.6/conf/flink-conf.yaml

/opt/flink/flink-1.14.6/bin/stop-cluster.sh
rm -rf /opt/flink/flink-1.14.6/log/*
/opt/flink/flink-1.14.6/bin/start-cluster.sh
/opt/flink/flink-1.14.6/bin/sql-client.sh -f /opt/flink/smt/result/flink-create.all.sql

/opt/flink/flink-1.14.6/bin/flink list

tailf /opt/flink/flink-1.14.6/log/flink-root-taskexecutor-0-*.log

Expected behavior (Required)

do not shut down BE

Real behavior (Required)

StarRocks version (Required)

Lordeath commented 2 years ago

sometimes quit by: I1025 23:29:03.222785 15293 server.cpp:1100] Server[starrocks::BackendInternalServiceImpl<starrocks::PInternalService>+starrocks::LakeServiceImpl+starrocks::BackendInternalServiceImpl<doris::PBackendService>] is going to quit

no idea why

Lordeath commented 2 years ago
[root@localhost bin]# java -version
openjdk version "1.8.0_345"
OpenJDK Runtime Environment (build 1.8.0_345-b01)
OpenJDK 64-Bit Server VM (build 25.345-b01, mixed mode)
Lordeath commented 2 years ago

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.345.b01-1.el7_9.x86_64

Lordeath commented 2 years ago

阿里云服务上买的机器的BE也挂了,版本是2.3.0,be.out的输出是这样的:

I1027 07:27:43.205304 30725 rowset_merger.cpp:263] compaction merge finished. tablet=11387 #key=3 algorithm=VERTICAL_COMPACTION column_group_size=38 input(entry=40 rows=23360 del=0 actual=23360 bytes=6.08 MB) output(rows=23360 chunk=44 bytes=4.76 MB) duration: 371ms
I1027 07:27:43.211071 30725 tablet_updates.cpp:1215] commit compaction tablet:11387 version:789.1 rowset:812 #seg:1 #row:23360 size:4.76 MB #pending:0 state_memory:1.49 MB
I1027 07:27:43.211123 19631 tablet_updates.cpp:1241] apply_compaction_commit start tablet:11387 version:789.1 rowset:812
I1027 07:27:43.218067 19631 tablet_updates.cpp:1388] apply_compaction_commit finish tablet:11387 version:789.1 total del/row:0/23360 0% rowset:812 #row:23360 #del:0 #delvec:1 duration:6ms(0/6/0)
I1027 07:27:43.218343 30725 tablet_manager.cpp:672] Found the best tablet to compact. compaction_type=update tablet_id=11269 highest_score=1268949620
I1027 07:27:43.218374 30725 tablet_updates.cpp:1688] update compaction start tablet:11269 version:789 score:1268949632 pick:39/valid:39/all:39 773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811 #rows:23494->23494 bytes:5.91 MB->5.91 MB(estimate)
I1027 07:27:43.247026 30723 rowset_merger.cpp:263] compaction merge finished. tablet=11573 #key=3 algorithm=VERTICAL_COMPACTION column_group_size=38 input(entry=39 rows=23307 del=0 actual=23307 bytes=5.95 MB) output(rows=23307 chunk=43 bytes=4.75 MB) duration: 266ms
I1027 07:27:43.250888 30723 tablet_updates.cpp:1215] commit compaction tablet:11573 version:789.1 rowset:812 #seg:1 #row:23307 size:4.75 MB #pending:0 state_memory:1.49 MB
I1027 07:27:43.250921 19631 tablet_updates.cpp:1241] apply_compaction_commit start tablet:11573 version:789.1 rowset:812
I1027 07:27:43.255218 19631 tablet_updates.cpp:1388] apply_compaction_commit finish tablet:11573 version:789.1 total del/row:0/23307 0% rowset:812 #row:23307 #del:0 #delvec:1 duration:5ms(0/5/0)
I1027 07:27:43.255479 30723 tablet_manager.cpp:672] Found the best tablet to compact. compaction_type=update tablet_id=11775 highest_score=1268927983
I1027 07:27:43.255498 30723 tablet_updates.cpp:1688] update compaction start tablet:11775 version:789 score:1268928000 pick:39/valid:39/all:39 773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811 #rows:23413->23413 bytes:5.99 MB->5.99 MB(estimate)
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
src/central_freelist.cc:333] tcmalloc: allocation failed 40960 
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
src/central_freelist.cc:333] tcmalloc: allocation failed 8192 
terminate called after throwing an instance of 'terminate called recursively
*** Aborted at 1666826863 (unix time) try "date -d @1666826863" if you are using GNU date ***
St9bad_alloc'
  what():  std::bad_allocPC: @     0x7fc42373c387 __GI_raise
*** SIGABRT (@0x75fa) received by PID 30202 (TID 0x7fc34650b700) from PID 30202; stack trace: ***

    @          0x3fa3ad2 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fc4241f1630 (unknown)
    @     0x7fc42373c387 __GI_raise
    @     0x7fc42373da78 __GI_abort
    @          0x59a87f2 __gnu_cxx::__verbose_terminate_handler()
    @          0x59a72a6 __cxxabiv1::__terminate()
    @          0x59a7311 std::terminate()
    @          0x59a7464 __cxa_throw
    @          0x1888484 _Znwm.cold
    @          0x1b2ab1e starrocks::vectorized::ChunkHelper::convert_field_to_format_v2()
    @          0x1b2b9c1 starrocks::vectorized::ChunkHelper::convert_schema_to_format_v2()
    @          0x1bb1388 starrocks::vectorized::MemTable::MemTable()
    @          0x3315bb4 starrocks::vectorized::DeltaWriter::_reset_mem_table()
    @          0x33161c0 starrocks::vectorized::DeltaWriter::write()
    @          0x3307943 starrocks::vectorized::AsyncDeltaWriter::_execute()
    @          0x40742ec bthread::ExecutionQueueBase::_execute()
    @          0x40750b8 bthread::ExecutionQueueBase::_execute_tasks()
    @          0x2132549 starrocks::ThreadPool::dispatch_thread()
    @          0x212e0fa starrocks::Thread::supervise_thread()
    @     0x7fc4241e9ea5 start_thread
    @     0x7fc423804b0d __clone
    @                0x0 (unknown)
Lordeath commented 2 years ago
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1667285873 (unix time) try "date -d @1667285873" if you are using GNU date ***
PC: @          0x60d9c38 (unknown)
*** SIGSEGV (@0x7efd59cdc578) received by PID 37551 (TID 0x7efd614f0700) from PID 1506657656; stack trace: ***
    @          0x481e332 (unknown)
    @     0x7efe2f92d630 (unknown)
    @          0x60d9c38 (unknown)
    @          0x60db403 (unknown)
    @          0x60db5b8 (unknown)
    @          0x60dca9b (unknown)
    @          0x6063580 jemalloc_usable_size
    @          0x25c4bf5 free
    @     0x7efe2efb3522 __libc_thread_freeres
    @     0x7efe2f925eb8 start_thread
    @     0x7efe2ef4096d __clone
    @                0x0 (unknown)
Lordeath commented 2 years ago

是我的问题,我把最后的 tailf /opt/flink/flink-1.14.6/log/flink-root-taskexecutor-0-*.log 去掉就好了