Open edsburke opened 4 years ago
I happen to run EchoServer and EchoClient in the wangle examples. It confirms CLOSE_WAIT issue also exists. CLOSE_WAIT socks just accumulate and transition to dangling status forever, and are never released unless the process is killed.
Could anyone suggest what should be done to debug/fix this issue? Thanks.
netstat output of EchoClient
tcp 0 0 172.31.38.97:64115 54.118.66.170:8080 FIN_WAIT2
tcp 0 0 172.31.38.97:64117 54.118.66.170:8080 FIN_WAIT2
netstat output of EchoServer
tcp6 0 0 54.118.66.170:8080 172.31.38.97:64115 CLOSE_WAIT
tcp6 0 0 54.118.66.170:8080 172.31.38.97:64117 CLOSE_WAIT
lsof -p 2825 | grep TCPv6
EchoServe 2825 root 20u sock 0,8 0t0 2044556 protocol: TCPv6
EchoServe 2825 root 23u sock 0,8 0t0 2044795 protocol: TCPv6
Just verified the latest code v2020.04.06.00 has the same issue on Ubuntu 16.04. Running multiple times ./EchoClient from one machine(i.e. 172.31.38.97), ./EchoServer is running from another machine (i.e. 172.26.1.197). When EchoClient is done, many CLOSE_WAIT server socks are lingering there.
Note that tcp_fin_timeout has been tuned on client machine to be long enough (120 seconds) so that EchoServer has the chance to send LAST_ACK.
TCP: About FIN_WAIT_2, TIME_WAIT and CLOSE_WAIT
netstat -anp | grep EchoServer
tcp6 0 0 :::8080 :::* LISTEN 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48758 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48756 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48762 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48754 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48770 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48764 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48760 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48766 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48768 CLOSE_WAIT 26314/EchoServer
Try to bump this thread. Hey Wangle team, could you please suggest how should it be debugged or fixed? Thanks a lot.
Does Wangle team notice this big issue from user community? Please kindly advice how to fix it or work it around? Appreciated!
You should close the socket after client leaved,use EchoServer as a example
class EchoHandler :public wangle::HandlerAdapter<std::string> {
public:
void read(Context* ctx, std::string msg)override {
std::cout << "handling " << msg << std::endl;
write(ctx, msg + "\r\n");
}
// close the socket
void readEOF(Context* ctx) {
ctx->fireClose();
}
};
Hey Wangle team,
thank you guys for the wonderful work. We have been using wangle to build a RPC layer pretty well in our projects except too many dangling CLOSE_WAIT connections on Ubuntu 16, eventually halting the server for any responses. With much efforts of debugging, we resort to get hints from here. Note that all destructors and close methods are called properly. I'd highlight the code structure here. Your help on how to debug this are greatly appreciated!!!!
Btw, wangle v2018.10.22.00 is used in this case.
Specifically, client sock is in FIN_WAIT2 status, and server sock is in CLOSE_WAIT status. Client socks soon disappear (prob. being forced to close by OS), however, server socks in CLOSE_WAIT accumulate, turning into dangling socks leaving CLOSE_WAIT status.
netstat output of client socks,
netstat output of server socks
Waiting for a while, CLOSE_WAIT socks transition to dangling status, lsof -p 5489 | grep TCPv6
Code wise, for client side, RpcClient forwards request to RpcConnection maintained by ConnectionPool based on connection id (e.g. host and port). RpcConnection internally has RpcService which is ClientDispatcher created by ConnectionFactory.
For server side, it's very straightforward,