Kernel fine-tuning to sustain load

RaJiska commented 1 year ago

Hi,

I'd like to open a discussion regarding fck-nat used for a production-ready type of load. Currently the way it's configured might not be enough for such a load as I could not see kernel tweaking configuration in scripts. Unfortunately I am no expert in Kernel tweaking and am not aware of all the configurations that might be necessary, but here are a few that I can think of:

Kernel keeps track of active connections via conntrack, the conntrack table once filled might drop new connections:
- nf_conntrack_max which governs the maximum number of tracked connections (and optionally nf_conntrack_buckets for performances)
- nf_conntrack_tcp_timeout_* to a lower value than the default perhaps ?
Networking stack
- tcp_wmem, tcp_rmem, udp_wmem, udp_rmem which should probably be increased so it can support a higher load
- tcp_max_syn_backlog
Maximum number of file descriptors via fs.max-files which limit could be overflowed if there are too many connections

Perhaps some more could be added, but it'd be interesting to have different profiles available that might be used depending of the usage intended of fck-nat.

AndrewGuenther commented 1 year ago

If you're seeing the kind of volume which would require these kernel tweaks, you're likely at a point where fck-nat cannot sustain you or NAT Gateway would be more reasonable. Here's my logic on that:

Instances with less than 32vCPUs are limited to 5Gbps internet egress bandwidth[1]. I think it is highly unlikely you would hit these limits in any environment which is using less than 5Gbps.

Instances with over 32vCPUs give you 50% of the advertised bandwidth[1]. The cheapest network optimized instance with 32vCPUs is a c6gn.8xlarge which maxes out at 25Gbps and costs ~$980 more per month to operate than NAT Gateway. You'd need to have a 21TB egress for that to break even with data transfer. So really this optimization is for people in that boat and if you're in that boat you're likely to want the availability and bandwidth (up to 100Gbps) guarantees that NAT Gateway provides.

I'm not saying I wouldn't accept contributions for this, just wanted to add some color as to why I haven't pursued this already.

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html

patrickdk77 commented 1 year ago

The two issues are unrelated. Tuning those variables and how much bandwidth you can push are not related in any way.

I could have a million idle tcp connections, or 1 connection that is maxing out my bandwidth.

Tuning the numbers, adjusted tcp timeout from 12hours to something more reasonable, and increasing the default number of connections the kernel will track are based on memory, not network speed.

AndrewGuenther commented 1 year ago

I understand they're unrelated, but I'm talking about likely use cases and how I've prioritized work. If you're utilizing a high number of connection you're likely utilizing higher bandwidth. Again, I'm not saying that I wouldn't accept contributions/tackle this work, I'm just giving my reasoning as to why it hasn't been done already and a disclaimer that if you're worried about a large number of connections, you should consider this information about bandwidth as well.

RaJiska commented 1 year ago

Instances with less than 32vCPUs are limited to 5Gbps internet egress bandwidth[1]. I think it is highly unlikely you would hit these limits in any environment which is using less than 5Gbps.

Thank you for this additional context, I was actually not aware of this limitation of 5Gbps per instance for internet-gateway bound network, it's really sneaky of them.

That said, I have encountered the case where a single of my instances (in a public subnet) would have its conntrack table entirely filled and dropping new connections, while being nowhere near the 5Gbps limitation. In this scenario, a fck-nat instance without kernel tuning would not have been able to sustain the load, and even less if this scenario had other instances. In this case tuning kernel would really help but would also require more resources, especially in terms of memory, which probably would require at least a t4g.medium, or even a r7g.medium, which would have a similar hourly rate as NAT GW (excluding saving plans), but without the extra GB processing fee, which in this case, might be the bulk of the bill.

The intention behind this issue is more to open a discussion on the matter and perhaps establish a comprehensive list of settings that that would cover this case where fck-nat would need to handle a large number of connections without necessarily reaching its bandwidth limit.

philipg commented 12 months ago

you can avoid the 5gbps limit by sharding the public internet ip prefixes via CIDR deaggregation. i.e. multiple fck-nat NATs for a single VPC via route table manipulation.

RaJiska commented 12 months ago

@philipg To put simply, creating smaller private subnets, each with their own NAT instance? This would work but unfortunately requires changes to the networking layer just to accommodate this technical constraint, which is not ideal.

philipg commented 12 months ago

@RaJiska the other way around. sharding the public internet. multiple routes. so instead of 0.0.0.0/0 you split the internet address space up.

RaJiska commented 12 months ago

This is a clever trick. Thanks for sharing this idea.

AndrewGuenther / fck-nat

Kernel fine-tuning to sustain load #40