apache / brpc

brpc is an Industrial-grade RPC framework using C++ Language, which is often used in high performance system such as Search, Storage, Machine learning, Advertisement, Recommendation etc. "brpc" means "better RPC".
https://brpc.apache.org
Apache License 2.0
16.41k stars 3.97k forks source link

当下游节点故障重启后,上游节点有概率会一直连不上 #1168

Closed gaodayue closed 2 years ago

gaodayue commented 4 years ago

Describe the bug (描述bug) 当下游节点X故障重启后,集群有时候会出现某个上游节点Y一直无法连接X的情况,其他上游节点在健康检查后会重建与X的连接。例如

1) 下游节点10.26.44.32在09:02:17因故障重启后,某个上游节点Y没有重建与10.26.44.32的连接,日志中持续输出"Not connected to 10.26.44.32:8060 yet"

W0716 09:02:17.695824 142210 input_messenger.cpp:212] Fail to read from Socket{id=896 fd=4228 addr=10.26.44.32:8060:55309} (0xa352000): Connection reset by peer [104]
W0716 09:02:17.702852 141955 data_stream_sender.cpp:138] failed to send brpc batch, error=Host is down, error_text=[E104]Fail to read from Socket{id=896 fd=4228 addr=10.26.44.32:8060:55309} (0x0xa352000): Connection reset by peer [R1][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R2][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R3][E112]Not connected to 10.26.44.32:8060 yet, server_id=896
....忽略类似内容....
W0716 09:09:58.714361 38298 data_stream_sender.cpp:138] failed to send brpc batch, error=Host is down, error_text=[E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R1][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R2][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R3][E112]Not connected to 10.26.44.32:8060 yet, server_id=896

2)查看netstat发现没有Y与10.26.44.32的TCP连接 3)查看Y的/connections发现Socket状态为Broken,信息如下

$ curl http://localhost:8060/connections | grep 10.26.44.32:8060
Broken                    |10.26.44.32:8060   |55309|-  |-           |-    |-        |-     |-         |-       |-         |-     |-         |-       |-          |896

$ curl http://localhost:8060/sockets/896
# This is a broken Socket
version=1
shared_part={
  ref_count=1
  socket_pool=null
  creator_socket=896
  in_size=316616369
  in_num_messages=12120483
  out_size=114960271066
  out_num_messages=12120511
}
nref=1
nevent=1
fd=4228
tos=0
reset_fd_to_now=485975008182us
remote_side=10.26.44.32:8060
local_side=10.22.180.15:55309
on_et_events=0x1bc5dd0
user=(brpc::InputMessenger*)0x5c0ab40
this_id=896
preferred_index=1 (baidu_std)
hc_count=0
avg_input_msg_size=26
read_buf=0
last_read_to_now=960432766us
last_write_to_now=960412394us
overcrowded=0
id_wait_list={}
parsing_context=0
pipeline_q=0
hc_interval_s=3
ninprocess=1
auth_flag_error=0
auth_id=177098681547473
auth_context=0
logoff_flag=0
recycle_flag=1
agent_socket_id=(none)
cid=0
write_head=0
ssl_state=SSL_OFF
tcpi={
  state=7
  ca_state=0
  retransmits=0
  probes=0
  backoff=0
  options=7
  snd_wscale=7
  rcv_wscale=7
  rto=205000
  ato=40000
  snd_mss=1448
  rcv_mss=736
  unacked=0
  sacked=0
  lost=0
  retrans=0
  fackets=0
  last_data_sent=960413
  last_ack_sent=0
  last_data_recv=960433
  last_ack_recv=960413
  pmtu=1500
  rcv_ssthresh=52260
  rtt=2750
  rttvar=3000
  snd_ssthresh=18
  snd_cwnd=18
  advmss=1448
  reordering=3
}

4)Y日志中没有"Checking Socket"和"Revived Socket"的日志(集群启用了健康检查,health_check_interval = 3,其他上游节点有Checking和Revived日志)

5)stack中与brpc相关的pthread包括 1个sample线程 + 1个timer线程 + 1个dispatcher线程 + 48个worker线程

Thread 568 (Thread 0x7ff75256e700 (LWP 140858)):
#0  0x00007ff75263480d in nanosleep () from /lib64/libc.so.6
#1  0x00007ff7526650e4 in usleep () from /lib64/libc.so.6
#2  0x0000000001cf10a2 in bvar::detail::SamplerCollector::run (this=0x5342000) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bvar/detail/sampler.cpp:130
#3  0x0000000001cf1839 in bvar::detail::SamplerCollector::sampling_thread (arg=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bvar/detail/sampler.cpp:81
#4  0x00007ff753268e65 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ff75266d88d in clone () from /lib64/libc.so.6

Thread 256 (Thread 0x7ff6b58c8700 (LWP 142170)):
#0  0x00007ff752667ba9 in syscall () from /lib64/libc.so.6
#1  0x0000000001d1ad43 in futex_wait_private (timeout=0x0, expected=257, addr1=0x7ff68572cf18) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/sys_futex.h:37
#2  bthread::internal::FastPthreadMutex::lock_contended (this=0x7ff68572cf18) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/mutex.cpp:661
#3  0x0000000001d0aab4 in lock_guard (__m=..., this=<synthetic pointer>) at /usr/local/include/c++/7.2.0/bits/std_mutex.h:162
#4  erase_from_butex (state=bthread::WAITER_STATE_TIMEDOUT, wakeup=true, bw=0x7ff68572ce20) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/butex.cpp:467
#5  bthread::erase_from_butex_and_wakeup (arg=0x7ff68572ce20) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/butex.cpp:450
#6  0x0000000001d29da0 in bthread::TimerThread::Task::run_and_delete (this=this@entry=0x1737a7080) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/timer_thread.cpp:277
#7  0x0000000001d2aa72 in bthread::TimerThread::run (this=0xd18c300) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/timer_thread.cpp:391
#8  0x0000000001d2b7e9 in bthread::TimerThread::run_this (arg=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/timer_thread.cpp:117
#9  0x00007ff753268e65 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ff75266d88d in clone () from /lib64/libc.so.6

Thread 254 (Thread 0x7ff6b48c6700 (LWP 142172)):
#0  0x00007ff75266de63 in epoll_wait () from /lib64/libc.so.6
#1  0x0000000001bbb64c in brpc::EventDispatcher::Run (this=0x1517d448) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/brpc/event_dispatcher.cpp:286
#2  0x0000000001bbb769 in brpc::EventDispatcher::RunThis (arg=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/brpc/event_dispatcher.cpp:271
#3  0x0000000001d24af7 in bthread::TaskGroup::task_runner (skip_remained=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/task_group.cpp:293
#4  0x0000000001d0d891 in bthread_make_fcontext ()
#5  0x00010102464c457f in ?? ()
#6  0x0000000000000000 in ?? ()

Thread 253 (Thread 0x7ff6b40c5700 (LWP 142173)): 相同堆栈的worker线程共48个(bthread_worker_count = 49)
#0  0x00007ff752667ba9 in syscall () from /lib64/libc.so.6
#1  0x0000000001d21d8d in futex_wait_private (timeout=0x0, expected=<optimized out>, addr1=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/sys_futex.h:37
#2  wait (expected_state=..., this=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/parking_lot.h:57
#3  bthread::TaskGroup::wait_task (this=this@entry=0x6eff6a80, tid=tid@entry=0x7ff6b40c1728) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/task_group.cpp:121
#4  0x0000000001d252eb in bthread::TaskGroup::run_main_task (this=this@entry=0x6eff6a80) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/task_group.cpp:149
#5  0x0000000001d1f4b9 in bthread::TaskControl::worker_thread (arg=0x7c7a300) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/task_control.cpp:73
#6  0x00007ff753268e65 in start_thread () from /lib64/libpthread.so.0
#7  0x00007ff75266d88d in clone () from /lib64/libc.so.6

To Reproduce (复现方法)

生产环境小概率出现,目前需要通过重启上游节点恢复。

Versions (各种版本) OS: CentOS Linux release 7.1.1503 (Core) Compiler: gcc (GCC) 7.2.0 brpc: 0.9.5

jamesge commented 4 years ago

"日志中持续输出Not connected to 10.26.44.32:8060 yet", 这是指Y还是其他(正常的)上游?Y所在的server并没有陷入死锁吧?

gaodayue commented 4 years ago

"日志中持续输出Not connected to 10.26.44.32:8060 yet", 这是指Y还是其他(正常的)上游?Y所在的server并没有陷入死锁吧?

是指Y节点。应该是没有死锁,除了连不上X(没有触发心跳检查),其他功能都正常。

zhengchengyao commented 3 years ago

@gaodayue 您好,请问下,这个问题后面是怎么解决的?我们现在也遇到这个问题,非常感谢!