chime / terraform-aws-alternat

High availability implementation of AWS NAT instances.
MIT License
1.04k stars 64 forks source link

AlterNAT at scale Questions #108

Open thedoomcrewinc opened 1 month ago

thedoomcrewinc commented 1 month ago

Our Dev/Staging environments quite frankly suck, low traffic at best and in testing we encountered no issues at all (this is good)

When we tried to simulate a production environment, AlterNAT we noticed the instance would start to drop traffic when we reached higher Lambda Execution volumes. We hit the instance's PPS limit and packets were dropped. Increasing the NAT instance class seemed to help.

Our Production environment is a different beast.
The vast majority of our NAT traffic is from Lambda executions. (Occasionally bursting past 300,000 executions per minute)

I'm concerned about hitting a PPS Limit and having drops in production.

Since you stated you send several PB of traffic, I'm going to guess your traffic is a lot more than ours (it would make sense)

Our short lived lambdas (for SSO, Dynamodb lookups, Small API Requests) are all quick in and out, but our long running lambdas can run upwards of 6 minutes (Data continues to flow to/from the browser during this so the pps do not stop)

Without going into specifics:

All of our lambda's use a single subnet with a NAT Gateway in that subnet. I unfortunately cannot do that as re-engineering the architecture is not feasible until winter 2025.

(This has been transferred from an email with @bwhaley for public visibility and comments)

bwhaley commented 1 month ago

Thanks for the question.

We have observed non-zero values in the ENA metrics, namely bw_{in,out}_allowance_exceeded and pps_allowance_exceeded. The values have been so low relative to the total bandwidth/pps that we've just ignored it and counted on TCP to sort it out. Some number of them are probably queued/delayed, but some number are dropped and will result in a retransmit.

These most likely happen because of microbursts in traffic. It's pretty hard to avoid. You can keep upsizing instances until you get to the max to see if that resolves it. It will definitely help, as you observed and as AWS states, such as in this article. If it doesn't, fixing microbursts can be challenging. This article mentions some advanced strategies for mitigating bursts if you cannot scale horizontally (e.g. in your case since you have the constraint of a single subnet/route table), but those approaches are not going to work with Lambdas.

If you haven't already seen it, this article discusses how you can measure pps limits if you want to test different instance types. You may also be able to set up packet captures and look for retransmits to see how widespread the problem actually is.

I opened #107 to make it easier to expose these metrics with Alternat.

bwhaley commented 1 month ago

@thedoomcrewinc Does this help at all? Are you going to try some larger instances or anything as a next step?

thedoomcrewinc commented 1 month ago

@bwhaley Apologies for the delay in update.

We're in a push for our Back to School effort, and I won't be able to test this until after Aug 1st. I'll update shortly there after.

thedoomcrewinc commented 1 week ago

Follow up as promised:

After testing various instances classes and size combos up to a c7gn.16xlarge, we determined that the sweet spot was a c6gn.8xlarge instance.

We too observed non-zero values, however like you observed and commented on, we can safely say we are not worrying about the impact.

Microbursts do occur but in general we don't worry about it.

I'll report back in September after the vast majority of schools are back in session and our traffic levels have stabilized at the new "normal"

bwhaley commented 1 day ago

Thanks, I appreciate that you're following up here!