chime / terraform-aws-alternat

High availability implementation of AWS NAT instances.
MIT License
1.06k stars 66 forks source link

Debugging stuck connections #62

Closed Halama closed 1 year ago

Halama commented 1 year ago

Hello, we are trying to use Alternat as a replacement for our Managed NAT Gateway. In our use case, data from various sources is uploaded/downloaded from the internet through NAT.

We are facing random stuck connections when downloading data from MySQL via Alternat. We do not see these errors happening with the same configurations when using the Managed NAT Gateway. It's important to note that there were no Alternat failovers (route changes) when the connection got stuck. We have confirmed that the issue is definitely related to routing via Alternat. Unfortunately, we are not able to simulate this issue in our testing environment in isolation.

We have implemented monitoring provided by ENA and we do not see that we are hitting any limits during the times of errors, and we should not be close to the limits. Our instance is currently m6g.4xlarge, according to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html.

image image

When the connection gets stuck, we performed the following checks:

cpu=0           found=0 invalid=166 ignore=59199 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=115
cpu=1           found=0 invalid=160 ignore=58152 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=125
cpu=2           found=0 invalid=161 ignore=60999 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=108
cpu=3           found=1 invalid=187 ignore=89662 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=113
cpu=4           found=0 invalid=170 ignore=73766 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=91
cpu=5           found=0 invalid=186 ignore=75964 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=108
cpu=6           found=0 invalid=172 ignore=77936 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=116
cpu=7           found=0 invalid=165 ignore=96004 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=104
cpu=8           found=0 invalid=0 ignore=64714 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=7
cpu=9           found=0 invalid=174 ignore=48375 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=84
cpu=10          found=0 invalid=161 ignore=83150 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=126
cpu=11          found=0 invalid=391 ignore=58921 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=239
cpu=12          found=0 invalid=176 ignore=77967 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=108
cpu=13          found=0 invalid=150 ignore=122890 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=128
cpu=14          found=0 invalid=154 ignore=62822 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=89
cpu=15          found=0 invalid=168 ignore=65315 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=113

We have seen established connections in conntrack -L on both source nodes and alternat nodes calling.

And limited activity for these connection in tcpdump:

tcpdump host x.x.x.x
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
05:54:00.294341 IP ip-10-10-29-139.ec2.internal.56482 > x.x.x.x.6011: Flags [F.], seq 3003739363, ack 1787942545, win 1463, options [nop,nop,TS val 2024721780 ecr 933756998], length 0
05:54:00.294345 IP ip-10-10-29-139.ec2.internal.56482 > x.x.x.x.6011: Flags [F.], seq 0, ack 1, win 1463, options [nop,nop,TS val 2024721780 ecr 933756998], length 0
05:54:02.002371 IP ip-10-10-29-139.ec2.internal.60756 > x.x.x.x.6011: Flags [F.], seq 1433822988, ack 3418363755, win 1365, options [nop,nop,TS val 3357797591 ecr 933756998], length 0
05:54:02.002377 IP ip-10-10-29-139.ec2.internal.60756 > x.x.x.x.6011: Flags [F.], seq 0, ack 1, win 1365, options [nop,nop,TS val 3357797591 ecr 933756998], length 0
05:54:07.178172 IP ip-10-10-29-139.ec2.internal.56084 > x.x.x.x.6012: Flags [F.], seq 3978364809, ack 1664524663, win 1210, options [nop,nop,TS val 16024769 ecr 933756998], length 0
05:54:07.178178 IP ip-10-10-29-139.ec2.internal.56084 > x.x.x.x.6012: Flags [F.], seq 0, ack 1, win 1210, options [nop,nop,TS val 16024769 ecr 933756998], length 0

I understand that it might be difficult to determine the root cause based on the information provided. We would appreciate any ideas on where to look and what tools to use for debugging. Have we missed any other limits or useful metrics? Particularly, where can the behavior be different from Managed NATs?

One of the differences that we noticed was the 350-second idle timeout for Managed NATs. We have added sysctl net.netfilter.nf_conntrack_tcp_timeout_established=350 for Alternat nodes.

bwhaley commented 1 year ago

Apologies for the delay, I've been out for a few weeks.

Can you collect some additional conntrack statistics on the alternat nodes?

bwhaley commented 1 year ago

random stuck connections when downloading data from MySQL

Is this only occurring when downloading data from MySQL? Have you observed stuck connections for anything else? And you confirm precisely what you mean by stuck connections - is it that data is being sent from an external MySQL to some internal node but sometimes stops before completing?

Halama commented 1 year ago

Hi @bwhaley, I apologize for the delay. We were still investigating the issue, and we have made progress in isolating and identifying the root cause.

During our investigation, we discovered that stuck connections were present even when we were using a NAT gateway. However, these stuck connections manifested themselves as 2-hour delays between retries. It appears that these stuck connections are likely caused by a race condition bug in ProxySQL, which is utilized in that environment.

image

Regarding the difference in timeout handling, the Managed NAT gateway and AlterNAT behave differently, as described in the documentation provided at https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-comparison.html. In the case of the Managed NAT gateway, stuck connections are closed after 2 hours, aligning with the TTL of the keep-alive timeout (net.ipv4.tcp_keepalive_time) on job worker nodes. On the other hand, with AlterNAT, stuck connections remain open indefinitely.

AWS Managed NAT gateway - When a connection times out, a NAT gateway returns an RST packet to any resources behind the NAT gateway that attempt to continue the connection (it does not send a FIN packet).

image

AlterNAT - When a connection times out, a NAT instance sends a FIN packet to resources behind the NAT instance to close the connection.

image

To achieve the same timeout behavior as implemented in the AWS Managed NAT Gateway, is there any way to configure AlterNAT to match this behavior?

bwhaley commented 1 year ago

Thanks for the packet traces and additional debugging info. I am not yet sure what the exact issue is. Based on the packet captures you shared, it seems like the NAT Gateway behaves differently after the TCP Keep-Alive is sent. The NAT Gateway sends a RST, while the NAT instance acknowledges the Keep-Alive.

As a debugging step, would you be able to try some different keepalive settings using #69? Perhaps something like:

tcp_keepalive_time         = 600
tcp_keepalive_intvl        = 60
tcp_keepalive_probes       = 10
bwhaley commented 1 year ago

Actually, for troubleshooting, it might be faster for you to just manually set the values on the instance directly. When we find out what works we can update alternat to expose it as a configuration. If possible, could you experiment with a few sysctl settings and see if anything helps? A few settings you might want to try (one at a time!):

sysctl -w net.ipv4.tcp_keepalive_time=600
sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1
sysctl -w net.netfilter.nf_conntrack_tcp_ignore_invalid_rst=1
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=300 # Default is 12 hours

Or other conntrack settings.

bwhaley commented 1 year ago

Closing this since it hasn't been active in a while. I'm keen to get to the bottom of it, though, so do feel free to re-engage if you're able to come back to it.