apache / brpc

brpc is an Industrial-grade RPC framework using C++ Language, which is often used in high performance system such as Search, Storage, Machine learning, Advertisement, Recommendation etc. "brpc" means "better RPC".
https://brpc.apache.org
Apache License 2.0
16.36k stars 3.96k forks source link

RDMA failure #2265

Open 372046933 opened 1 year ago

372046933 commented 1 year ago

Describe the bug (描述bug)

W0528 17:36:56.793694   510 external/brpc/src/brpc/input_messenger.cpp:240] Close Socket{id=1248 fd=1395 addr=10.156.8.12:24452:24141} (0x55c9fa744b40) due to unknown message: \F87\88>[\BE\C0<\00\00\00\80\CA|\E1=\F87\88>[\BE\C0<\CA\E0U\BE\CA|\E1=\F87\88>[\BE\C0<\00\00\00\80\CA|\E1=\F87\88>[\BE\C0<S}:=O\E9\82\BD...<skipping 16256 bytes>
W0528 17:36:56.794202   510 external/brpc/src/brpc/policy/baidu_rpc_protocol.cpp:265] Fail to write into Socket{id=1248 fd=1395 addr=10.156.8.12:24452:24141} (0x55c9fa744b40): Invalid argument

To Reproduce (复现方法) master分支,RDMA打开的情况,数小时之内必现

Expected behavior (期望行为)

Versions (各种版本) OS: Compiler: brpc: protobuf:

Additional context/screenshots (更多上下文/截图)

Tuvie commented 1 year ago

上面这个错误,贴下另外一端的日志?10.156.8.30这个

372046933 commented 1 year ago

客户端发现Controller->Failed为true

LOG(WARNING) << method_name_ << " RPC failed, " << cntl.ErrorText()
2023-07-24 15:26:52.658294: W xxx/yyy.cc:394] StreamSend RPC failed, [E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R1][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R2][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R3][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R4][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R5][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R6][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R7][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R8][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R9][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R10][E112]Not connected to 10.156.9.20:16619 yet

...

W0724 15:26:52.656967  1267 external/brpc/src/brpc/controller.cpp:675] `CanRetryBackoffInPthread()' returns false, skip retry backoff in pthread.
W0724 15:26:52.657038  1267 external/brpc/src/brpc/controller.cpp:675] `CanRetryBackoffInPthread()' returns false, skip retry backoff in pthread.
W0724 15:26:52.657049  1267 external/brpc/src/brpc/controller.cpp:675] `CanRetryBackoffInPthread()' returns false, skip retry backoff in pthread.

...
Tuvie commented 1 year ago

看日志里面,是用到了rpc的stream么?

372046933 commented 1 year ago

没有Stream,只是RPC名字包含Stream

LOG(WARNING) << method_name_ << " RPC failed, " ...
Tuvie commented 1 year ago

两边的日志打包发一下看看?

372046933 commented 1 year ago

日志不多的。其中有些日志是我在BRPC里面加的

diff --git a/src/brpc/rdma/rdma_endpoint.cpp b/src/brpc/rdma/rdma_endpoint.cpp
index d3a91560..e7e60ff4 100644
--- a/src/brpc/rdma/rdma_endpoint.cpp
+++ b/src/brpc/rdma/rdma_endpoint.cpp
@@ -251,7 +251,9 @@ void RdmaConnect::StartConnect(const Socket* socket,
     }
 }

-void RdmaConnect::StopConnect(Socket* socket) { }
+void RdmaConnect::StopConnect(Socket* socket) {
+    LOG(INFO) << "RdmaConnect StopConnect called";
+}

 void RdmaConnect::Run() {
     _done(errno, _data);
@@ -274,6 +276,7 @@ static void TryReadOnTcpDuringRdmaEst(Socket* s) {
                 break;
             }
         } else if (nr == 0) {
+            LOG(INFO) << "Got socket EOF, " << *s;
             s->SetEOF();
             return;
         } else {
@@ -330,7 +333,7 @@ void RdmaEndpoint::OnNewDataFromTcp(Socket* m) {
     }
 }

-bool HelloNegotiationValid(HelloMessage& msg) {
+bool HelloNegotiationValid(const HelloMessage& msg) {
     if (msg.hello_ver == g_rdma_hello_version &&
         msg.impl_ver == g_rdma_impl_version &&
         msg.block_size >= MIN_BLOCK_SIZE &&
@@ -360,6 +363,7 @@ int RdmaEndpoint::ReadFromFd(void* data, size_t len) {
                     }
                 }
             } else {
+                PLOG(WARNING) << "Failed to read";
                 return -1;
             }
         } else if (nr == 0) {  // Got EOF
@@ -1030,7 +1034,7 @@ int RdmaEndpoint::PostRecv(uint32_t num, bool zerocopy) {
         if (_rq_received == _rq_size) {
             _rq_received = 0;
         }
-    };
+    }
     return 0;
 }

@@ -1168,11 +1172,6 @@ int RdmaEndpoint::BringUpQp(uint16_t lid, ibv_gid gid, uint32_t qp_num) {
         return -1;
     }

-    if (PostRecv(_rq_size, true) < 0) {
-        PLOG(WARNING) << "Fail to post recv wr";
-        return -1;
-    }
-
     attr.qp_state = IBV_QPS_RTR;
     attr.path_mtu = IBV_MTU_1024;  // TODO: support more mtu in future
     attr.ah_attr.grh.dgid = gid;
@@ -1202,6 +1201,11 @@ int RdmaEndpoint::BringUpQp(uint16_t lid, ibv_gid gid, uint32_t qp_num) {
         return -1;
     }

+    if (PostRecv(_rq_size, true) < 0) {
+        PLOG(WARNING) << "Fail to post recv wr";
+        return -1;
+    }
+
     attr.qp_state = IBV_QPS_RTS;
     attr.timeout = TIMEOUT;
     attr.retry_cnt = RETRY_CNT;
@@ -1293,7 +1297,8 @@ void RdmaEndpoint::DeallocateResources() {
 static const int MAX_CQ_EVENTS = 128;

 int RdmaEndpoint::GetAndAckEvents() {
-    int events = 0; void* context = NULL;
+    int events = 0;
+    void* context = NULL;
     while (1) {
         if (IbvGetCqEvent(_resource->comp_channel, &_resource->cq, &context) < 0) {
             if (errno != EAGAIN) {
@@ -1354,11 +1359,13 @@ void RdmaEndpoint::PollCq(Socket* m) {
                 // that the event arrives after the poll but before the notify,
                 // we should re-poll the CQ once after the notify to check if
                 // there is an available CQE.
-                if (ibv_req_notify_cq(ep->_resource->cq, 1) < 0) {
+                if (ibv_req_notify_cq(ep->_resource->cq, /*solicited_only=*/1) <
+                    0) {
                     const int saved_errno = errno;
-                    PLOG(WARNING) << "Fail to arm CQ comp channel: " << s->description();
+                    PLOG(WARNING)
+                        << "Fail to arm CQ comp channel: " << s->description();
                     s->SetFailed(saved_errno, "Fail to arm cq channel from %s: %s",
-                            s->description().c_str(), berror(saved_errno));
+                                s->description().c_str(), berror(saved_errno));
                     return;
                 }
                 notified = true;
@@ -1387,7 +1394,8 @@ void RdmaEndpoint::PollCq(Socket* m) {
                 PLOG(WARNING) << "Fail to handle RDMA completion, error status("
                               << wc[i].status << "): " << s->description();
                 s->SetFailed(ERDMA, "RDMA completion error(%d) from %s: %s",
-                             wc[i].status, s->description().c_str(), berror(ERDMA));
+                             wc[i].status, s->description().c_str(),
+                             berror(ERDMA));
                 continue;
             }

diff --git a/src/brpc/socket.cpp b/src/brpc/socket.cpp
index c49ca083..4ebba8d2 100644
--- a/src/brpc/socket.cpp
+++ b/src/brpc/socket.cpp
@@ -803,6 +803,7 @@ int Socket::WaitAndReset(int32_t expected_nref) {

 #if BRPC_WITH_RDMA
     if (_rdma_ep) {
+        LOG(WARNING) << "Reset RdmaEndpoint, " << description();
         _rdma_ep->Reset();
         _rdma_state = RDMA_UNKNOWN;
     }

server

BRPC 日志:
W0724 21:29:37.355377   978 external/brpc/src/brpc/input_messenger.cpp:247] Close Socket{id=8589934594 fd=297 addr=10.156.8.29:29765:37376} (0x560a5f5cc480): absolutely wrong message

应用层日志:
W0724 21:29:37.355577   838 ./xxx/yyy/zzz.h:119] Method:[FuncFoo] Fail to get response from shard#0 [E22]Close Socket{id=8589934594 fd=297 addr=10.156.8.29:29765:37376} (0x0x560a5f5cc480): absolutely wrong message

BRPC 日志
I0724 21:29:37.458484  1004 external/brpc/src/brpc/socket.cpp:2465] Checking Socket{id=8589934594 addr=10.156.8.29:29765} (0x560a5f5cc480)
I0724 21:29:37.458786  1016 external/brpc/src/brpc/socket.cpp:2525] Revived Socket{id=8589934594 addr=10.156.8.29:29765} (0x560a5f5cc480) (Connectable)
I0724 21:29:42.355888   985 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:418] Start handshake on 10.156.0.21:54384
I0724 21:29:42.362131   995 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:543] Handshake ends (use rdma) on Socket{id=8589934594 fd=297 addr=10.156.8.29:29765:54384} (0x0x560a5f5cc480)

client

I0724 21:29:37.446674  1045 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:279] Got socket EOF, Socket{id=25769805360 fd=300 addr=10.156.0.21:37376:29765} (0x557bf1110480)
I0724 21:29:37.449893  1090 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:560] Start handshake on Socket{id=8589943858 fd=1413 addr=10.156.0.21:54366:29765} (0x0x557ef5f2c000)
W0724 21:29:37.449923  1090 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:568] Fail to read Hello Message from client:Socket{id=8589943858 fd=1413 addr=10.156.0.21:54366:29765} (0x0x557ef5f2c000) 10.156.0.21:54366: Got EOF
I0724 21:29:42.347077  1047 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:560] Start handshake on Socket{id=17179869414 fd=1358 addr=10.156.0.21:54384:29765} (0x0x557bf0b22900)
I0724 21:29:42.353112  1097 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:696] Handshake ends (use rdma) on Socket{id=17179869414 fd=1358 addr=10.156.0.21:54384:29765} (0x0x557bf0b22900)
372046933 commented 1 year ago

RDMA handshake只需要一次就可以建立QP了吧? 看日志在Close Socket后,又发生了一次信息交换。 感觉像是触发了什么BUG。。。

Tuvie commented 1 year ago

Close Socket之后要自动重连吧。client没有自动重连的日志吗

372046933 commented 1 year ago

这些就是重连日志吧,不过为啥会突然Close Socket呢?这是预期内的行为吗?

I0724 21:29:37.458484  1004 external/brpc/src/brpc/socket.cpp:2465] Checking Socket{id=8589934594 addr=10.156.8.29:29765} (0x560a5f5cc480)
I0724 21:29:37.458786  1016 external/brpc/src/brpc/socket.cpp:2525] Revived Socket{id=8589934594 addr=10.156.8.29:29765} (0x560a5f5cc480) (Connectable)
I0724 21:29:42.355888   985 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:418] Start handshake on 10.156.0.21:54384
I0724 21:29:42.362131   995 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:543] Handshake ends (use rdma) on Socket{id=8589934594 fd=297 addr=10.156.8.29:29765:54384} (0x0x560a5f5cc480)
Tuvie commented 1 year ago

Close Socket的原因是检查到rpc消息格式错误。 Close Socket后触发client做健康检查(Checking Socket),tcp建连成功则认为健康检查通过(Revived Socket)。但是这个健康检查连接会直接close掉。但是server还是会准备rdma握手逻辑(start handshake)因为server不知道这是一个健康检查连接。但是因为client自己close了这个连接,所以server的握手失败(Got EOF)。后面的这个状态是没问题的。

根本的问题还是在rpc消息格式错误这里。

372046933 commented 1 year ago

哪里加日志方便排查问题呢?消息格式错误一般是啥原因导致的

Tuvie commented 1 year ago

可以按照下方的改动试下把发送方向和接收方向看到的message的内容打印下看看。 不过这个日志量级应该会很大。多rotate一下,看看能不能捕捉到出错。

--- a/src/brpc/rdma/rdma_endpoint.cpp
+++ b/src/brpc/rdma/rdma_endpoint.cpp
@@ -896,6 +896,7 @@ ssize_t RdmaEndpoint::CutFromIOBufList(butil::IOBuf** from, size_t ndata) {
             return -1;
         }

+        LOG_IF(INFO, FLAGS_rdma_trace_verbose) << _sbuf[_sq_current];
         ++_sq_current;
         if (_sq_current == _sq_size - RESERVED_WR_NUM) {
             _sq_current = 0;
@@ -950,6 +951,7 @@ ssize_t RdmaEndpoint::HandleCompletion(ibv_wc& wc) {
     case IBV_WC_RECV: {  // recv completion
         // Please note that only the first wc.byte_len bytes is valid
         if (wc.byte_len > 0) {
+            LOG_IF(INFO, FLAGS_rdma_trace_verbose) << _rbuf[_rq_received];
             if (wc.byte_len < (uint32_t)FLAGS_rdma_zerocopy_min_size) {
                 zerocopy = false;
             }
372046933 commented 1 year ago

好的,这个日志量可能没法落盘,实在是太大了(日志量是几个GB/s)。有没有从代码层面分析的可能呢?

另外请教下,当wc.byte_len < (uint32_t)FLAGS_rdma_zerocopy_min_size时,zerocopy为False。此时_rbuf_data[_rq_received]指向的内存是分配过的吗? https://github.com/apache/brpc/blob/92ad8828d49aeee97d88a7492727f2342e292d32/src/brpc/rdma/rdma_endpoint.cpp#L946

Tuvie commented 1 year ago

分配过的。

日志尝试收集一下吧。

372046933 commented 1 year ago

是在哪里分配的可以贴一下吗? 日志没法log这么大,做不了。就算全落下来,一次RPC需要几秒到十几秒,没法复现问题。

Tuvie commented 1 year ago

PostRecv里面分配的。

372046933 commented 1 year ago

说的是这里吧,zerocopy为False时,DoPostRecv接收的地址传给ibv_sge.addr,此时的_rbuf_data[_rq_received]在哪里分配并注册MR的呢? https://github.com/apache/brpc/blob/e6881ee35cca2fb56e6761675f7de1461e275dc1/src/brpc/rdma/rdma_endpoint.cpp#L1008-L1024

Tuvie commented 1 year ago

这些用的是内存池里的内存。一开始就注册了

372046933 commented 1 year ago

这个报错也有点奇怪,ProtocolType应该是BAIDU_STD才对

https://github.com/apache/brpc/blob/3bc4366b314b55427a2bcd4819d841e0bd8b6fb4/src/brpc/input_messenger.cpp#L115-L122

E0726 12:28:46.931434   969 external/brpc/src/brpc/input_messenger.cpp:122] Fail to parse response from 10.156.15.27:17657 by baidu_std at client-side
W0726 12:28:46.931502   969 external/brpc/src/brpc/input_messenger.cpp:247] Close Socket{id=8589934594 fd=1409 addr=10.156.15.27:17657:38500} (0x5628b07bc480): absolutely wrong message
372046933 commented 1 year ago
I0726 16:25:12.445185  1011 external/brpc/src/brpc/input_messenger.cpp:116] cur_index=1 preferred=1 result.error()=1 result.error_str()=try other protocols
I0726 16:25:12.445248  1011 external/brpc/src/brpc/input_messenger.cpp:116] cur_index=2 preferred=1 result.error()=1 result.error_str()=try other protocols
E0726 16:25:12.445270  1011 external/brpc/src/brpc/input_messenger.cpp:126] Fail to parse response from 10.156.15.16:20493 by baidu_std at client-side
W0726 16:25:12.445318  1011 external/brpc/src/brpc/input_messenger.cpp:251] Close Socket{id=8589934619 fd=1341 addr=10.156.15.16:20493:64774} (0x556c6752fcc0): absolutely wrong message
diff --git a/src/brpc/input_messenger.cpp b/src/brpc/input_messenger.cpp
index 43167d5b..787fc993 100644
--- a/src/brpc/input_messenger.cpp
+++ b/src/brpc/input_messenger.cpp
@@ -113,7 +113,11 @@ ParseResult InputMessenger::CutInputMessage(
             }

             if (m->CreatedByConnect()) {
-                if((ProtocolType)cur_index == PROTOCOL_BAIDU_STD) {
+                LOG(INFO) << "cur_index=" << cur_index
+                          << " preferred=" << preferred
+                          << " result.error()=" << result.error()
+                          << " result.error_str()=" << result.error_str();
+                if ((ProtocolType)cur_index == PROTOCOL_BAIDU_STD) {
                     // baidu_std may fall to streaming_rpc.
                     cur_index = (int)PROTOCOL_STREAMING_RPC;
                     continue;
@@ -332,7 +336,7 @@ void InputMessenger::OnNewMessages(Socket* m) {
     // - If the socket has only one message, the message will be parsed and
     //   processed in this bthread. nova-pbrpc and http works in this way.
     // - If the socket has several messages, all messages will be parsed (
-    //   meaning cutting from butil::IOBuf. serializing from protobuf is part of
+    //   meaning cutting from butil::IOBuf. Deserializing from protobuf is part of
     //   "process") in this bthread. All messages except the last one will be
     //   processed in separate bthreads. To minimize the overhead, scheduling
     //   is batched(notice the BTHREAD_NOSIGNAL and bthread_flush).
Tuvie commented 1 year ago

这些都是数据错误的结果,不是原因。至少先尝试出错的时候FATAL core出来吧

372046933 commented 1 year ago

哪个地方abort呢? 在应用层出现RPC失败的时候吗?出core了之后,查看那些变量呢?

Tuvie commented 1 year ago

就在出现协议错误这里。fatal前也顺便brpc::rdma::DumpMemoryPoolInfo()下。

另外,也贴下你和rdma相关的所有的flag设置看看。以及程序你的user_data是多大的?

SimonCqk commented 2 months ago

借楼问一下,brpc里可以直接对一个IOBuf执行RegisterMemoryForRdma么?没找到相关文档和代码