Closed Halama closed 1 year ago
Apologies for the delay, I've been out for a few weeks.
Can you collect some additional conntrack statistics on the alternat nodes?
random stuck connections when downloading data from MySQL
Is this only occurring when downloading data from MySQL? Have you observed stuck connections for anything else? And you confirm precisely what you mean by stuck connections - is it that data is being sent from an external MySQL to some internal node but sometimes stops before completing?
Hi @bwhaley, I apologize for the delay. We were still investigating the issue, and we have made progress in isolating and identifying the root cause.
During our investigation, we discovered that stuck connections were present even when we were using a NAT gateway. However, these stuck connections manifested themselves as 2-hour delays between retries. It appears that these stuck connections are likely caused by a race condition bug in ProxySQL, which is utilized in that environment.
Regarding the difference in timeout handling, the Managed NAT gateway and AlterNAT behave differently, as described in the documentation provided at https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-comparison.html. In the case of the Managed NAT gateway, stuck connections are closed after 2 hours, aligning with the TTL of the keep-alive timeout (net.ipv4.tcp_keepalive_time
) on job worker nodes. On the other hand, with AlterNAT, stuck connections remain open indefinitely.
AWS Managed NAT gateway - When a connection times out, a NAT gateway returns an RST packet to any resources behind the NAT gateway that attempt to continue the connection (it does not send a FIN packet).
AlterNAT - When a connection times out, a NAT instance sends a FIN packet to resources behind the NAT instance to close the connection.
To achieve the same timeout behavior as implemented in the AWS Managed NAT Gateway, is there any way to configure AlterNAT to match this behavior?
Thanks for the packet traces and additional debugging info. I am not yet sure what the exact issue is. Based on the packet captures you shared, it seems like the NAT Gateway behaves differently after the TCP Keep-Alive is sent. The NAT Gateway sends a RST, while the NAT instance acknowledges the Keep-Alive.
As a debugging step, would you be able to try some different keepalive settings using #69? Perhaps something like:
tcp_keepalive_time = 600
tcp_keepalive_intvl = 60
tcp_keepalive_probes = 10
Actually, for troubleshooting, it might be faster for you to just manually set the values on the instance directly. When we find out what works we can update alternat to expose it as a configuration. If possible, could you experiment with a few sysctl settings and see if anything helps? A few settings you might want to try (one at a time!):
sysctl -w net.ipv4.tcp_keepalive_time=600
sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1
sysctl -w net.netfilter.nf_conntrack_tcp_ignore_invalid_rst=1
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=300 # Default is 12 hours
Closing this since it hasn't been active in a while. I'm keen to get to the bottom of it, though, so do feel free to re-engage if you're able to come back to it.
Hello, we are trying to use Alternat as a replacement for our Managed NAT Gateway. In our use case, data from various sources is uploaded/downloaded from the internet through NAT.
We are facing random stuck connections when downloading data from MySQL via Alternat. We do not see these errors happening with the same configurations when using the Managed NAT Gateway. It's important to note that there were no Alternat failovers (route changes) when the connection got stuck. We have confirmed that the issue is definitely related to routing via Alternat. Unfortunately, we are not able to simulate this issue in our testing environment in isolation.
We have implemented monitoring provided by ENA and we do not see that we are hitting any limits during the times of errors, and we should not be close to the limits. Our instance is currently m6g.4xlarge, according to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html.
When the connection gets stuck, we performed the following checks:
We have seen established connections in
conntrack -L
on both source nodes and alternat nodes calling.And limited activity for these connection in
tcpdump
:I understand that it might be difficult to determine the root cause based on the information provided. We would appreciate any ideas on where to look and what tools to use for debugging. Have we missed any other limits or useful metrics? Particularly, where can the behavior be different from Managed NATs?
One of the differences that we noticed was the 350-second idle timeout for Managed NATs. We have added
sysctl net.netfilter.nf_conntrack_tcp_timeout_established=350
for Alternat nodes.