carllerche / tower-web

A fast, boilerplate free, web framework for Rust
MIT License
981 stars 51 forks source link

CLOSE_WAIT issue and memory leak eventually trigger OOM-Killer #167

Closed chastabor closed 5 years ago

chastabor commented 5 years ago

I'm not sure where to post this as this is probably a lower level issue. From my understanding of CLOSE_WAIT our application end of the connection has received a FIN from the remote client on the other end, but the OS is waiting for the our application to actually close its connection. From the top, netstat, and strace results I'm seeing that our tower-web application is still trying to send data to the connection that is waiting to be closed. This may be causing a memory leak which eventually causes the OOM-Killer to kill the application. Note I cannot seem to replicate this issue, it seems to happen at random. One time I found the the OOM-Killer killed the application 3 times in one day. Other times it takes a week. I posted what I'm seeing here:

#----- (1) 13:49:20
#----- Found two connections are in a closed wait state with large send queue
#----- NOTE: strace (1-2) posted at the end has large number of sendto request
#----- that happened at this time and are probably causing the High CPU
#----- utilization seen in top (2) output.
netstat -tnn 
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 172.17.0.2:8443         62.<remove-ip>:44013       TIME_WAIT  
tcp        0      0 172.17.0.2:8443         10.216.80.112:59662     ESTABLISHED
tcp        0      0 172.17.0.2:8443         10.219.159.103:50693    ESTABLISHED
tcp        1 148920 172.17.0.2:8443         72.<remove-ip>:54774     CLOSE_WAIT 
tcp        1 1608920 172.17.0.2:8443         72.<remove-ip>:54771     CLOSE_WAIT 

#----- (2) 13:50:28
#----- Application found in a bad state with High CPU utilization (200%) and Memory usage
top -o %MEM
top - 13:50:28 up 1 day, 15:12,  3 users,  load average: 2.17, 2.10, 1.37
Tasks: 224 total,   1 running, 223 sleeping,   0 stopped,   0 zombie
%Cpu(s): 40.6 us, 59.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8010100 total,   143176 free,  2739396 used,  5127528 buff/cache
KiB Swap:  1048572 total,   941128 free,   107444 used.  4871688 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                         
26857 apache    20   0 2553392   2.4g   2800 S 200.0 31.6  48:19.45 mediadepot                                                                                                      
  559 root      20   0   81220  35084  34788 S   0.0  0.4   0:15.44 systemd-journal                                                                                                 
 1400 root      20   0  957820  25596  10412 S   0.0  0.3   2:52.23 dockerd-latest                                                                                                  
 ...

#----- (3) 13:52:30
#----- Connections are finally gone after a period of at least a minute maybe more.
netstat -tnn 
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 172.17.0.2:8443         10.216.80.112:59662     ESTABLISHED
tcp        0      0 172.17.0.2:8443         10.219.159.103:50693    ESTABLISHED

#----- (4) 13:52:38
#----- Memory does not go back down. Usually will drop back down to 0.1% when idle.
top -o %MEM
top - 13:52:38 up 1 day, 15:15,  3 users,  load average: 1.64, 2.02, 1.43
Tasks: 227 total,   1 running, 226 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8010100 total,   902632 free,  1979752 used,  5127716 buff/cache
KiB Swap:  1048572 total,   941128 free,   107444 used.  5631340 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                         
26857 apache    20   0 2549796   1.7g   2800 S   0.0 22.1  50:15.81 mediadepot
  559 root      20   0   81220  35384  35088 S   0.0  0.4   0:15.48 systemd-journal
 1400 root      20   0  957820  25604  10412 S   0.0  0.3   2:52.34 dockerd-latest
  ...

#----- (5) 14:03:23
#----- CPU and Memory are backup up after another connection.
#----- CPU usually doesn't get very high when application is in a
#----- a good state, especially for a single request. And definitely
#----- doesn't stay there after the connection is closed.
top -o %MEM
top - 14:03:23 up 1 day, 15:25,  2 users,  load average: 0.93, 0.75, 0.94
Tasks: 220 total,   1 running, 219 sleeping,   0 stopped,   0 zombie
%Cpu(s):  9.3 us, 36.0 sy,  0.0 ni, 54.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8010100 total,   127044 free,  2660464 used,  5222592 buff/cache
KiB Swap:  1048572 total,   941128 free,   107444 used.  4953144 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                         
26857 apache    20   0 2551852   2.3g   2800 S  54.8 30.7  52:24.09 mediadepot                                                                                                      
  559 root      20   0   81220  35756  35460 S   0.0  0.4   0:15.55 systemd-journal                                                                                                 
 1400 root      20   0  957820  25808  10408 S   0.0  0.3   2:53.02 dockerd-latest                                                                                                  
  ...

#----- (6) 14:04:05
#----- Notice what looks like to me that the application is once again trying to push
#----- data to a socket in a CLOSE_WAIT state. At this point I can no longer run
#----- strace on the application. get the following strace error:
#----- strace: attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
netstat -tnn 
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 172.17.0.2:8443         10.216.80.112:59662     ESTABLISHED
tcp       32 250504 172.17.0.2:8443         96.<remove-ip>:51173       CLOSE_WAIT 
tcp        0      0 172.17.0.2:8443         147.<remove-ip>:56458    ESTABLISHED
tcp        0      0 172.17.0.2:8443         10.219.159.103:50693    ESTABLISHED

#----- (1-2) strace showing (with many of the duplicate entries removed)
#----- what it was doing to during the CLOSE_WAIT state, with no packets
#----- seen leaving the server over the network. This busy work
#----- is probably what is causing the CPU utilization to peg at 200% with
#----- little work actually being done:
MDPID=$(docker inspect --format {{.State.Pid}} mediadepot)
strace -fp $MDPID
...
[pid 26885] sendto(9, "Q<K\274e\374\27]\30\270\\\211\275\330]\325Lp\334B\224\223g>\327\224\237l\325H7\326"..., 8814, MSG_NOSIGNAL, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 26885] sendto(9, "Q<K\274e\374\27]\30\270\\\211\275\330]\325Lp\334B\224\223g>\327\224\237l\325H7\326"..., 8814, MSG_NOSIGNAL, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 26885] sendto(9, "Q<K\274e\374\27]\30\270\\\211\275\330]\325Lp\334B\224\223g>\327\224\237l\325H7\326"..., 8814, MSG_NOSIGNAL, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
...
... at least 100+K of the same message (as my terminal only keeps that many lines.)
...
[pid 26885] sendto(9, "Q<K\274e\374\27]\30\270\\\211\275\330]\325Lp\334B\224\223g>\327\224\237l\325H7\326"..., 8814, MSG_NOSIGNAL, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 26885] sendto(9, "Q<K\274e\374\27]\30\270\\\211\275\330]\325Lp\334B\224\223g>\327\224\237l\325H7\326"..., 8814, MSG_NOSIGNAL, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 26885] sendto(9, "Q<K\274e\374\27]\30\270\\\211\275\330]\325Lp\334B\224\223g>\327\224\237l\325H7\326"..., 8814, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
[pid 26884] <... epoll_pwait resumed> [{EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=3632267270, u64=3632267270}}], 1024, -1, NULL, 8) = 1
[pid 26885] <... sendto resumed> )      = -1 ETIMEDOUT (Connection timed out)
[pid 26884] epoll_pwait(4,  <unfinished ...>
[pid 26885] write(1, "failed to serve connection; err="..., 131) = 131
[pid 26885] epoll_ctl(4, EPOLL_CTL_DEL, 9, 0x7f02190f1ef8) = 0
[pid 26885] close(9)                    = 0
[pid 26885] madvise(0x18d2000, 270336, MADV_DONTNEED) = 0
[pid 26885] madvise(0x18d2000, 532480, MADV_DONTNEED) = 0
[pid 26885] madvise(0x40c0000, 262144, MADV_DONTNEED) = 0
[pid 26885] madvise(0x40c0000, 524288, MADV_DONTNEED) = 0
[pid 26885] madvise(0x18d2000, 1077248, MADV_DONTNEED) = 0
[pid 26885] madvise(0x40c0000, 1048576, MADV_DONTNEED) = 0
[pid 26885] madvise(0x40c0000, 2101248, MADV_DONTNEED) = 0
[pid 26885] madvise(0x40c0000, 4206592, MADV_DONTNEED) = 0
[pid 26885] madvise(0x40c0000, 8400896, MADV_DONTNEED) = 0
[pid 26885] madvise(0x40c0000, 16789504, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1a64000, 516096, MADV_DONTNEED) = 0
[pid 26885] madvise(0x40c0000, 33566720, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1ae5000, 512000, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1b65000, 516096, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1be6000, 512000, MADV_DONTNEED) = 0
[pid 26885] madvise(0x18d2000, 2162688, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1ae5000, 1040384, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1be6000, 675840, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1ae5000, 1728512, MADV_DONTNEED) = 0
[pid 26885] madvise(0x206a000, 266240, MADV_DONTNEED) = 0
[pid 26885] madvise(0x206a000, 532480, MADV_DONTNEED) = 0
[pid 26885] madvise(0x206a000, 1056768, MADV_DONTNEED) = 0
[pid 26885] madvise(0x206a000, 2109440, MADV_DONTNEED) = 0
[pid 26885] madvise(0x307e000, 4214784, MADV_DONTNEED) = 0
[pid 26885] madvise(0xa4af000, 274432, MADV_DONTNEED) = 0
[pid 26885] madvise(0xa4af000, 536576, MADV_DONTNEED) = 0
[pid 26885] madvise(0xa4af000, 1064960, MADV_DONTNEED) = 0
[pid 26885] madvise(0xa4af000, 2117632, MADV_DONTNEED) = 0
[pid 26885] madvise(0xa4af000, 4222976, MADV_DONTNEED) = 0
[pid 26885] madvise(0xa4af000, 536891392, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1d2f000, 258048, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1d2f000, 520192, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1d2f000, 1048576, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1d2f000, 2101248, MADV_DONTNEED) = 0
[pid 26885] madvise(0x2d0d7000, 266240, MADV_DONTNEED) = 0
[pid 26885] madvise(0x2d0d7000, 528384, MADV_DONTNEED) = 0
[pid 26885] madvise(0x2d0d7000, 1056768, MADV_DONTNEED) = 0
[pid 26885] madvise(0x2d0d7000, 2109440, MADV_DONTNEED) = 0
[pid 26885] madvise(0x2d0d7000, 4198400, MADV_DONTNEED) = 0
[pid 26885] madvise(0x2d0d7000, 8392704, MADV_DONTNEED) = 0
[pid 26885] madvise(0x2d0d7000, 16781312, MADV_DONTNEED) = 0
[pid 26885] madvise(0x2d0d7000, 33558528, MADV_DONTNEED) = 0
[pid 26885] madvise(0x2d0d7000, 67108864, MADV_DONTNEED) = 0
[pid 26885] madvise(0x32c70000, 258048, MADV_DONTNEED) = 0
[pid 26885] madvise(0x32c70000, 520192, MADV_DONTNEED) = 0
[pid 26885] madvise(0x32c70000, 1048576, MADV_DONTNEED) = 0
[pid 26885] madvise(0x32c70000, 2101248, MADV_DONTNEED) = 0
[pid 26885] madvise(0x32c70000, 4206592, MADV_DONTNEED) = 0
[pid 26885] madvise(0x32c70000, 8400896, MADV_DONTNEED) = 0
[pid 26885] madvise(0x32c70000, 16789504, MADV_DONTNEED) = 0
[pid 26885] madvise(0x1fd8000, 4222976, MADV_DONTNEED) = 0
[pid 26885] madvise(0x32c70000, 33566720, MADV_DONTNEED) = 0
[pid 26885] madvise(0x32c70000, 67121152, MADV_DONTNEED) = 0
[pid 26885] madvise(0xa454000, 368640, MADV_DONTNEED) = 0
[pid 26885] munmap(0x7f0218762000, 1576960) = 0
[pid 26885] madvise(0x1c99000, 2912256, MADV_DONTNEED) = 0
[pid 26885] sched_yield()               = 0
[pid 26885] sched_yield()               = 0
[pid 26885] sched_yield()               = 0
[pid 26885] futex(0x7f02190f2024, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 26887] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 26887] futex(0x7f0218ce9f74, FUTEX_WAIT_PRIVATE, 2, {1, 279724892}) = -1 ETIMEDOUT (Connection timed out)
[pid 26887] futex(0x7f0218ce9f74, FUTEX_WAIT_PRIVATE, 2, {0, 59654493}) = -1 ETIMEDOUT (Connection timed out)
[pid 26887] futex(0x7f02190f2024, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 26885] <... futex resumed> )       = 0
[pid 26887] futex(0x7f0218cea024, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 26885] futex(0x7f0218cea024, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 26887] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 26885] <... futex resumed> )       = 0
[pid 26887] futex(0x17bafc4, FUTEX_WAIT_PRIVATE, 2147483664, NULL <unfinished ...>
[pid 26885] futex(0x17bafc4, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 26887] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 26887] futex(0x7f0218ce9f74, FUTEX_WAIT_PRIVATE, 2, {19, 139346331} <unfinished ...>
[pid 26885] <... futex resumed> )       = 0
[pid 26885] futex(0x7f0218ce9f74, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 26887] <... futex resumed> )       = 0
[pid 26885] <... futex resumed> )       = 1
[pid 26887] futex(0x17bafc4, FUTEX_WAIT_PRIVATE, 2147483664, NULL <unfinished ...>
[pid 26885] futex(0x17bafc4, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 26887] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 26885] <... futex resumed> )       = 0
[pid 26887] sched_yield( <unfinished ...>
[pid 26885] sched_yield( <unfinished ...>
[pid 26887] <... sched_yield resumed> ) = 0
[pid 26887] sched_yield( <unfinished ...>
[pid 26885] <... sched_yield resumed> ) = 0
[pid 26887] <... sched_yield resumed> ) = 0
[pid 26885] sched_yield( <unfinished ...>
[pid 26887] sched_yield( <unfinished ...>
[pid 26885] <... sched_yield resumed> ) = 0
[pid 26887] <... sched_yield resumed> ) = 0
[pid 26885] sched_yield( <unfinished ...>
[pid 26887] futex(0x7f0218ce9f74, FUTEX_WAIT_PRIVATE, 2, {19, 138772029} <unfinished ...>
[pid 26885] <... sched_yield resumed> ) = 0
[pid 26885] futex(0x7f02190f2024, FUTEX_WAIT_PRIVATE, 2, NULL^C
chastabor commented 5 years ago

So the application doen't always cause a oom, but rather sometimes becomes unresponsive even just on health checks which don't send files. This seems more like a hyper async issue.

netstat -tnn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:35896       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:40544       CLOSE_WAIT 
tcp      207      0 172.17.0.2:8443         <search-ip----------->:41318       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:54688       CLOSE_WAIT 
tcp      149      0 172.17.0.2:8443         <remote-client-ru1-ip>:59279       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:33500       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:42302       CLOSE_WAIT 
tcp      149      0 172.17.0.2:8443         <remote-client-ru1-ip>:52959       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:45622       CLOSE_WAIT 
tcp      149      0 172.17.0.2:8443         <remote-client-ru1-ip>:50905       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:50168       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:45468       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:53728       CLOSE_WAIT 
tcp      207      0 172.17.0.2:8443         <search-ip----------->:40490       CLOSE_WAIT 
tcp        0      0 172.17.0.2:8443         <local-client-1-ip--->:58551       ESTABLISHED
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:50436       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:43196       CLOSE_WAIT 
... (total of 126 CLOSE_WAIT sockets with the majority from health-check-ip)
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:43020       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:44078       CLOSE_WAIT 
tcp        1      0 172.17.0.2:8443         <remote-client-de1-ip>:44776       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:43616       CLOSE_WAIT 
tcp      283      0 172.17.0.2:8443         <remote-client-cn3-ip>:50685       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:35900       CLOSE_WAIT 
tcp      244      0 172.17.0.2:8443         <health-check-ip----->:48652       CLOSE_WAIT 
tcp      149      0 172.17.0.2:8443         <remote-client-ru1-ip>:45451       CLOSE_WAIT 

Closing out this issue as this doesn't seem to place to post it. I just now bumped up all the libraries to match what was being used in tower-web Cargo and setup the build process to use the latest Rust 2018 environment. Will see if these errors persist. If so I will post issue to the hyper repo.