AndrewGuenther / fck-nat

Feasible cost konfigurable NAT: An AWS NAT Instance AMI
https://fck-nat.dev
MIT License
1.32k stars 54 forks source link

ASG Refresh responce time? #80

Open DarkJoney opened 7 months ago

DarkJoney commented 7 months ago

Dear Andrew,

I have a question regarding the startup time of fck-nat on the instance refresh event.

We are currently testing if we can use your solution to replace the NAT Gateway in front of our EKS cluster. My colleague asked how fast a response time is and how big is network interruption in case we do fck-nat update e.t.c.

I have built a lab, where we have the whole fcknat with the EC2 behind in Frankfurt and test host in Ireland. fcknat has the EIP assignment I was running just a single ping command from host to host with the default 1-second interval. I have tried both ASG Instance Refresh strategies, with termination and terminate after launch - both are resulting in 50-100 lost ping attempts between the hosts.

I have had a look into the behaviour, meanwhile, a new instance is not up, and the old one is still present, IP gets reattached to the new instance which is offline. This causes a bigger waiting time, which is not seamless at all... image

I went through your code, will changing the script order logics have any effect? I mean, to perform the system configuration, and only then perform the EIP assignment to the new instance. What do you think? To summarise, I see more right to perform the EIP attachment after the instance is configured to act as a NAT gateway. Because for now, we are detaching the IP from the still-alive instance and causing the longer network disruption as it could be.

We have to use the EIP to have it fixed behind the DNS entry.

Please let me know what do you think, thank you!

DarkJoney commented 7 months ago

Did one more test on the UDP run with iperf, got the next result: image

The test run was 300 seconds, so, the up time was about 244-246 seconds if terminate and launch, and launch before terminating gives even worse results.

RaJiska commented 6 months ago

Hey @DarkJoney ,

As a word of warning, please note it might not be appropriate to run fck-nat for production environment that needs high availability or fast recovery. Furthermore, fck-nat (or any other EC2 acting as NAT for the record) might not be adapted for a production environment as per this.

Now for your question, switching the EIP is only one part of it and only happens if you configured a static EIP. The change actually happens when the ENI is swapped from the old instance to the new one, which currently is only been done once the previous instance is terminated, which is why you are facing a longer recovery time with the launch before terminating strategy.

One could speed up the process by explicitly detaching the ENI from the old instance and re-attaching it to the new one. I remember mentioning this before, but I believe the idea was eventually dropped as it added too much complexity for now.

Finally, you can have a look at https://github.com/AndrewGuenther/fck-nat/issues/71 which, if proven to work, and once implemented, would provide true HA to fck-nat.

AndrewGuenther commented 6 months ago

The change actually happens when the ENI is swapped from the old instance to the new one, which currently is only been done once the previous instance is terminated, which is why you are facing a longer recovery time with the launch before terminating strategy.

@RaJiska this is not correct. The EIP move is the very first thing that happens and in the case that the instance being replaced is healthy this takes NAT out of service until all the remaining steps complete. If EIP reassociation is moved to after the ENI attachment it should get the launch-after-terminate and launch-before-terminate times roughly equal.

One could speed up the process by explicitly detaching the ENI from the old instance and re-attaching it to the new one. I remember mentioning this before, but I believe the idea was eventually dropped as it added too much complexity for now.

The issue here is that it's going to take at least two API calls to do this. One to find the current attachment and another to perform the detach. I was initially concerned about the latency these calls might introduce given that they don't always need to be performed and relying on the termination of the previous instance was a relatively reliable indicator, but I'm open to re-evaluate. In the case of an instance becoming unhealthy and needing replacement, the current implementation is faster. In the case of an explicit replacement where a new instance is brought up before termination of the old instance, running the explicit detach is faster. I think the latency introduced by a single additional API call to check for attachments is likely worth the improvement of performing the explicit detach. I'll test this out and measure the impact.

I also think it wouldn't be unreasonable to remove the automatic disabling of src/dest checks. This should be done ahead of time and doing this call on every startup just slows things down. Or do it after all other steps so it's out of the critical path.

RaJiska commented 6 months ago

The EIP move is the very first thing that happens and in the case that the instance being replaced is healthy this takes NAT out of service until all the remaining steps complete.

Indeed, EIP is the first thing that is being moved, however NAT instances are allocated a random EIP upon boot to be able to interact with the AWS API. When reassociating, only a brief interruption would be expected as the old NAT instance would revert to its original EIP and subsequently resume its NAT activity with the original instance IP instead. Now thinking about this, it actually might be an issue in cases where someone has no tolerance for using an IP other than the static EIP defined.

I also think it wouldn't be unreasonable to remove the automatic disabling of src/dest checks.

This is not possible as this API call targets the dynamic ENI, as in the one that is automatically created by the instance. As far as I know it is not possible to configure the ENI with this feature in the launch template. That said, I am not sure if this setting needs to be turned on for both the dynamic and static ENI, or just the static one.

AndrewGuenther commented 6 months ago

Now thinking about this, it actually might be an issue in cases where someone has no tolerance for using an IP other than the static EIP defined.

I believe this is the exact situation that the reporter is describing.

This is not possible as this API call targets the dynamic ENI, as in the one that is automatically created by the instance. As far as I know it is not possible to configure the ENI with this feature in the launch template. That said, I am not sure if this setting needs to be turned on for both the dynamic and static ENI, or just the static one.

Ahhhh, that's right. Good call.

DarkJoney commented 6 months ago

The EIP move is the very first thing that happens and in the case that the instance being replaced is healthy this takes NAT out of service until all the remaining steps complete.

Indeed, EIP is the first thing that is being moved, however NAT instances are allocated a random EIP upon boot to be able to interact with the AWS API. When reassociating, only a brief interruption would be expected as the old NAT instance would revert to its original EIP and subsequently resume its NAT activity with the original instance IP instead. Now thinking about this, it actually might be an issue in cases where someone has no tolerance for using an IP other than the static EIP defined.

I also think it wouldn't be unreasonable to remove the automatic disabling of src/dest checks.

This is not possible as this API call targets the dynamic ENI, as in the one that is automatically created by the instance. As far as I know it is not possible to configure the ENI with this feature in the launch template. That said, I am not sure if this setting needs to be turned on for both the dynamic and static ENI, or just the static one.

Indeed, the EC2 in ASG gets a temporary IP, and when the 1st one detaches, it receives the one that was previously allocated. I am just curious if changing the system setup to first meanwhile having an instance in place will give any benefit to win some time.

Regarding the production workload, mostly, the kafka traffic passes by through the NAT Gateway. We did some raw benchmarks, especially on GN instance values are not that bad. Would you expect any issues for this use case? We have a situation that the fargate EKS cluster is CHEAPER than the traffic cost...

I also assumed that the actual AWS Gateway may have the tweaked kernel and network stuff because the raw performance is flat lining at some point and not scaling together with the instance size at some certain point.

In the "Limitations" section of the wiki you have mentioned 5-second failover time, how can I be achieved?

AndrewGuenther commented 6 months ago

I am just curious if changing the system setup to first meanwhile having an instance in place will give any benefit to win some time.

Moving the EIP allocation will bring launch-after-terminate and launch-before-terminate times more in line for your use case, and the explicit detach will bring it down a bit more, but that's all we can do for now.

Regarding the production workload, mostly, the kafka traffic passes by through the NAT Gateway. We did some raw benchmarks, especially on GN instance values are not that bad. Would you expect any issues for this use case? We have a situation that the fargate EKS cluster is CHEAPER than the traffic cost...

It's all about your tolerances to NAT downtime. I've deployed fck-nat to tons of production workloads, but there are some cases where I don't. All comes down to your requirements.

In the "Limitations" section of the wiki you have mentioned 5-second failover time, how can I be achieved?

The limitations section does not say 5-second failover time. It says it can be a 5 minute failover time in cases where you fully rely on ASG replacement of an unhealthy host and that launching a second instance while the first is still healthy would result in "a few seconds" of downtime, which is admittedly more ambiguous than it should be.

As part of the improvements we'll make here (reordering EIP attachment, adding an explicit detach) I'll do some benchmarking on replacement times and make those docs more objective.