apache / brpc

brpc is an Industrial-grade RPC framework using C++ Language, which is often used in high performance system such as Search, Storage, Machine learning, Advertisement, Recommendation etc. "brpc" means "better RPC".
https://brpc.apache.org
Apache License 2.0
16.05k stars 3.92k forks source link

连接处于CLOSE_WAIT状态导致健康检查失败 #2662

Open icexin opened 3 weeks ago

icexin commented 3 weeks ago

Describe the bug (描述bug) 服务端的一次宕机后,客户端就一直因为检查检查失败导致rpc失败,报错 [E112]Fail to select server from xxx。在出问题的机器上可以看到连接处于CLOSE_WAIT状态。

Versions (各种版本) OS: ubuntu 20.04 Compiler: clang-8 brpc: 1.8.0 protobuf: 3.15.8

Additional context/screenshots (更多上下文/截图)

image image
chenBright commented 2 weeks ago

有这个日志吗? https://github.com/apache/brpc/blob/2e183187bcbccc39c7da8dde2a98d02a7a031279/src/brpc/socket.cpp#L2564-L2567

chenBright commented 2 weeks ago

CLOSE_WAIT状态持续很久吗?

icexin commented 2 weeks ago

对,在我们手动重启之前一直是CLOSE_WAIT状态

chenBright commented 2 weeks ago

看着像是是还有rpc没有结束。框架内部要等到连接上全部rpc结束了,才会close fd,然后进行健康检查。

chenBright commented 2 weeks ago

另外,服务端没起来,健康检查也不会成功吧。

icexin commented 2 weeks ago

这边是客户端处于close_wait状态,我们用的是同步rpc,应该rpc很快就处理完了。服务端后来起来后,客户端也因为这个状态一直健康检查失败

chenBright commented 2 weeks ago

如果长时间处于CLOSE_WAIT状态,应该是某处持有了socket的引用,导致一直没有close fd,然后进行健康检查。