Closed chastabor closed 5 years ago
So the application doen't always cause a oom, but rather sometimes becomes unresponsive even just on health checks which don't send files. This seems more like a hyper async issue.
netstat -tnn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:35896 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:40544 CLOSE_WAIT
tcp 207 0 172.17.0.2:8443 <search-ip----------->:41318 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:54688 CLOSE_WAIT
tcp 149 0 172.17.0.2:8443 <remote-client-ru1-ip>:59279 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:33500 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:42302 CLOSE_WAIT
tcp 149 0 172.17.0.2:8443 <remote-client-ru1-ip>:52959 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:45622 CLOSE_WAIT
tcp 149 0 172.17.0.2:8443 <remote-client-ru1-ip>:50905 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:50168 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:45468 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:53728 CLOSE_WAIT
tcp 207 0 172.17.0.2:8443 <search-ip----------->:40490 CLOSE_WAIT
tcp 0 0 172.17.0.2:8443 <local-client-1-ip--->:58551 ESTABLISHED
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:50436 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:43196 CLOSE_WAIT
... (total of 126 CLOSE_WAIT sockets with the majority from health-check-ip)
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:43020 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:44078 CLOSE_WAIT
tcp 1 0 172.17.0.2:8443 <remote-client-de1-ip>:44776 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:43616 CLOSE_WAIT
tcp 283 0 172.17.0.2:8443 <remote-client-cn3-ip>:50685 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:35900 CLOSE_WAIT
tcp 244 0 172.17.0.2:8443 <health-check-ip----->:48652 CLOSE_WAIT
tcp 149 0 172.17.0.2:8443 <remote-client-ru1-ip>:45451 CLOSE_WAIT
Closing out this issue as this doesn't seem to place to post it. I just now bumped up all the libraries to match what was being used in tower-web Cargo and setup the build process to use the latest Rust 2018 environment. Will see if these errors persist. If so I will post issue to the hyper repo.
I'm not sure where to post this as this is probably a lower level issue. From my understanding of CLOSE_WAIT our application end of the connection has received a FIN from the remote client on the other end, but the OS is waiting for the our application to actually close its connection. From the top, netstat, and strace results I'm seeing that our tower-web application is still trying to send data to the connection that is waiting to be closed. This may be causing a memory leak which eventually causes the OOM-Killer to kill the application. Note I cannot seem to replicate this issue, it seems to happen at random. One time I found the the OOM-Killer killed the application 3 times in one day. Other times it takes a week. I posted what I'm seeing here: