"Reset" to NAT instances after failover

dan-greene-brivo commented 7 months ago

I'm putting this here to see if there's any interest in adding in the ability to "fall back" to the NAT instances after a failover due to curl failure. Or am I missing something that will set it back automatically?

I'm working on the code anyway, so I'm happy to make a PR if you think it's useful.

Right now, my first thought is to update the connection check lambdas so that the 1st time through, it checks the route table and if it's set to a NAT Gateway, change it to a NAT instance just before the first check, so if it's still down, it'll immediately be changed back. Effective, but will cause a connectivity blip every minute while failed over to NAT Gateway.

Option 2 is to have a separate lambda on a separate schedule (maybe every 15 minutes by default, or only on demand?) that if the route tables are using NAT Gateways, we run an "Instance Refresh" on the ASG, forcing it to re-create the instances. In theory, we could terminate the instances, and the ASG would do it's thing as well.

Thoughts?

bwhaley commented 7 months ago

If I understand correctly, what you're proposing is the following:

NAT instance fails connectivity checks for some reason.
Connectivity checker Lambda notices the failure and replaces the route to go through the NAT gateway.
Now the NAT instance is sitting around doing nothing.
Some time later, the NAT instance is able to connect again.
There should be a process to automatically switch back to the NAT instance.

Did I understand correctly?

If so, the first challenge is to how to know that the NAT instance has connectivity again. The route table now points to the NAT gateway. You'd need either:

Another, different route table that points to the NAT instance. Have a Lambda that is in a subnet that uses this route table. Have it checking connectivity. If connectivity succeeds, update the route to the instance again.
Or, have the NAT instance itself check its connection and update the route once connectivity is working.

I don't think we can use a solution like your first proposal because we do not want a "connectivity blip" - remaining connected is our highest priority. Remember that the connectivity checker runs every minute (by default) so you'd be interrupting the connection quite a lot, potentially, if the NAT instance is still broken.

Option (2) could have sorta the same problem. It could trigger an instance replacement, and the new instances would automatically claim the route at boot, as usual. But if it can't connect because the problem is somewhere else (e.g. the connectivity failure is not due to the NAT instance itself, but some AWS networking issue), then you'd end up in a loop where the new Lambda runs again, finds the NAT gateway as the route, terminates the instance, rinse & repeat.

I like the idea of a self-healing NAT instance, just need to find a practical approach.

dan-greene-brivo commented 6 months ago

I’ll start with just a lambda that resets the system while we figure out the least impactful time/mechanism to call it.

chime / terraform-aws-alternat

"Reset" to NAT instances after failover #90