facebookarchive / rocks-strata

133 stars 24 forks source link

strata AWS connections hang sometimes #22

Open tredman opened 7 years ago

tredman commented 7 years ago

Haven't had time to look into this but logging here for future reference. We have had about a dozen cases in the last week where we find the strata binary hung up for over 24h (normal time is < 5 min on average). In the below example this occurred while doing garbage collection, but I have also seen this happen with backups.

$ ps -ef | grep strata
root      24099  19667  0 Aug22 ?        00:02:21 /usr/bin/strata --bucket=<our bucket> --region=us-east-1 --bucket-prefix=mongo-rocks gc --replica-id=<replica id>

lsof shows one active connection to AWS

$ sudo lsof -p 24099
strata  24099 root  cwd    DIR    202,1     4096    16386 /root
strata  24099 root  rtd    DIR    202,1     4096        2 /
strata  24099 root  txt    REG    202,1  7663776    34085 /usr/bin/strata
strata  24099 root  mem    REG    202,1  1807032   395277 /lib/x86_64-linux-gnu/libc-2.15.so
strata  24099 root  mem    REG    202,1   135366   395280 /lib/x86_64-linux-gnu/libpthread-2.15.so
strata  24099 root  mem    REG    202,1   149280   395283 /lib/x86_64-linux-gnu/ld-2.15.so
strata  24099 root    0r  FIFO      0,8      0t0 17782110 pipe
strata  24099 root    1w  FIFO      0,8      0t0 17794429 pipe
strata  24099 root    2w  FIFO      0,8      0t0 17794429 pipe
strata  24099 root    3r   CHR      1,9      0t0     3081 /dev/urandom
strata  24099 root    4u   CHR      1,3      0t0     3076 /dev/null
strata  24099 root    5u  IPv4 17821048      0t0      TCP ip-10-252-0-135.ec2.internal:51681->s3-1.amazonaws.com:https (ESTABLISHED)
strata  24099 root    6u  0000      0,9        0    11491 anon_inode
strata  24099 root    9w  FIFO      0,8      0t0 17780115 pipe
strata  24099 root   11r   REG    202,1        0     6639 /tmp/mtools_backup.lock
strata  24099 root   63w  FIFO      0,8      0t0 17780115 pipe

netstat confirms this:

$ netstat -ant | grep 51681
tcp        0      0        ESTABLISHED

but tcpdump doesn't show any activity on this port.

$ sudo tcpdump -i eth0 -n tcp src port 51681
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
0 packets captured
0 packets received by filter
0 packets dropped by kernel

The connection is clearly dead but the process or kernel haven't figured it out. In any case, the workaround is to kick (kill -15) and let the next backup run as scheduled. Any missed files will get picked up on the next run.