baidu / braft

An industrial-grade C++ implementation of RAFT consensus algorithm based on brpc, widely used inside Baidu to build highly-available distributed systems.
Apache License 2.0
3.86k stars 865 forks source link

大压力下造成braft server段错误(Segmentation fault) #367

Open yanshub opened 2 years ago

yanshub commented 2 years ago

**1. 使用counter的例子。

  1. run_server端将sync 'false' 置为false。 线程设置为128。 DEFINE_string sync 'false' 'fsync each time' server运行参数为:-bthread_concurrency=128 -crash_on_fatal_log=true -raft_max_segment_size=8388608 -raft_sync=false -port=8100 -conf=127.0.0.1:8100:0,
  2. 客户端执行线程数改为50,没有sleep直接满压运行。 客户端运行参数为:./counter_client --add_percentage=100 --bthread_concurrency=100 --conf=127.0.0.1:8100:0, --crash_on_fatal_log=true --log_each_request=false --thread_num=50 --use_bthread=true --timeout_ms=1000**

运行后 qps可达12w。但运行几秒钟后server挂掉。(理应变慢或无法响应,不应挂掉)

客户端运行结果:

I0705 12:00:32.577170 22770 /home/ys/braft-1.1.2/example/counter/client.cpp:178] Sending Request to Counter (127.0.0.1:8100:0,) at qps=119160 latency=417 I0705 12:00:33.577314 22770 /home/ys/braft-1.1.2/example/counter/client.cpp:178] Sending Request to Counter (127.0.0.1:8100:0,) at qps=116432 latency=427 I0705 12:00:34.577444 22770 /home/ys/braft-1.1.2/example/counter/client.cpp:178] Sending Request to Counter (127.0.0.1:8100:0,) at qps=119674 latency=415 I0705 12:00:35.577571 22770 /home/ys/braft-1.1.2/example/counter/client.cpp:178] Sending Request to Counter (127.0.0.1:8100:0,) at qps=118686 latency=419 I0705 12:00:36.577692 22770 /home/ys/braft-1.1.2/example/counter/client.cpp:178] Sending Request to Counter (127.0.0.1:8100:0,) at qps=120076 latency=414 I0705 12:00:37.577862 22770 /home/ys/braft-1.1.2/example/counter/client.cpp:178] Sending Request to Counter (127.0.0.1:8100:0,) at qps=62620 latency=417 W0705 12:00:38.080393 22871 /home/ys/braft-1.1.2/example/counter/client.cpp:86] Fail to send request to 127.0.0.1:8100:0 : [E1008]Reached timeout=1000ms @127.0.0.1:8100 W0705 12:00:38.080418 22826 /home/ys/braft-1.1.2/example/counter/client.cpp:86] Fail to send request to 127.0.0.1:8100:0 : [E1008]Reached timeout=1000ms @127.0.0.1:8100 W0705 12:00:38.080410 22839 /home/ys/braft-1.1.2/example/counter/client.cpp:86] Fail to send request to 127.0.0.1:8100:0 : [E1008]Reached timeout=1000ms @127.0.0.1:8100 W0705 12:00:38.080488 22807 /home/ys/braft-1.1.2/example/counter/client.cpp:86] Fail to send request to 127.0.0.1:8100:0 : [E1008]Reached timeout=1000ms @127.0.0.1:8100 W0705 12:00:38.080434 22784 /home/ys/braft-1.1.2/example/counter/client.cpp:86] Fail to send request to 127.0.0.1:8100:0 : [E1008]Reached timeout=1000ms @127.0.0.1:8100

前半部分还正常,后半部分server,用gdb调试已经出现segmentation fault.

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffe5110700 (LWP 21973)] 0x000000000072d8aa in brpc::policy::ProcessRpcRequest (msg_base=0x5295e80) at src/brpc/policy/baidu_rpc_protocol.cpp:485 485 src/brpc/policy/baidu_rpc_protocol.cpp: 没有那个文件或目录. Missing separate debuginfos, use: debuginfo-install gflags-2.1.1-6.el7.x86_64 glibc-2.17-326.el7_9.x86_64 gperftools-libs-2.6.1-1.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-51.el7_9.x86_64 leveldb-1.12.0-11.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libselinux-2.5-15.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 openssl-libs-1.0.2k-25.el7_9.x86_64 pcre-8.32-17.el7.x86_64 protobuf-2.5.0-8.el7.x86_64 snappy-1.1.0-3.el7.x86_64 zlib-1.2.7-20.el7_9.x86_64

堆栈如下: (gdb) bt

0 0x000000000072d8aa in brpc::policy::ProcessRpcRequest (msg_base=0x5295e80)

at src/brpc/policy/baidu_rpc_protocol.cpp:485

1 0x00000000006a181a in brpc::ProcessInputMessage (void_arg=void_arg@entry=0x5295e80)

at src/brpc/input_messenger.cpp:147

2 0x00000000006a25c3 in operator() (this=, last_msg=0x5295e80)

at src/brpc/input_messenger.cpp:153

3 brpc::InputMessenger::OnNewMessages (m=0x3954000) at /usr/include/c++/4.8.2/bits/unique_ptr.h:184

4 0x000000000069385d in brpc::Socket::ProcessEvent (arg=0x3954000) at src/brpc/socket.cpp:1018

5 0x000000000065e05a in bthread::TaskGroup::task_runner (skip_remained=)

at src/bthread/task_group.cpp:295
yanshub commented 2 years ago

换设备后无此问题,是设备问题