Haven't had time to look into this but logging here for future reference. We have had about a dozen cases in the last week where we find the strata binary hung up for over 24h (normal time is < 5 min on average). In the below example this occurred while doing garbage collection, but I have also seen this happen with backups.
but tcpdump doesn't show any activity on this port.
$ sudo tcpdump -i eth0 -n tcp src port 51681
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel
The connection is clearly dead but the process or kernel haven't figured it out. In any case, the workaround is to kick (kill -15) and let the next backup run as scheduled. Any missed files will get picked up on the next run.
Haven't had time to look into this but logging here for future reference. We have had about a dozen cases in the last week where we find the strata binary hung up for over 24h (normal time is < 5 min on average). In the below example this occurred while doing garbage collection, but I have also seen this happen with backups.
lsof shows one active connection to AWS
netstat confirms this:
but tcpdump doesn't show any activity on this port.
The connection is clearly dead but the process or kernel haven't figured it out. In any case, the workaround is to kick (kill -15) and let the next backup run as scheduled. Any missed files will get picked up on the next run.