Open dan-greene-brivo opened 7 months ago
If I understand correctly, what you're proposing is the following:
Did I understand correctly?
If so, the first challenge is to how to know that the NAT instance has connectivity again. The route table now points to the NAT gateway. You'd need either:
I don't think we can use a solution like your first proposal because we do not want a "connectivity blip" - remaining connected is our highest priority. Remember that the connectivity checker runs every minute (by default) so you'd be interrupting the connection quite a lot, potentially, if the NAT instance is still broken.
Option (2) could have sorta the same problem. It could trigger an instance replacement, and the new instances would automatically claim the route at boot, as usual. But if it can't connect because the problem is somewhere else (e.g. the connectivity failure is not due to the NAT instance itself, but some AWS networking issue), then you'd end up in a loop where the new Lambda runs again, finds the NAT gateway as the route, terminates the instance, rinse & repeat.
I like the idea of a self-healing NAT instance, just need to find a practical approach.
I’ll start with just a lambda that resets the system while we figure out the least impactful time/mechanism to call it.
I'm putting this here to see if there's any interest in adding in the ability to "fall back" to the NAT instances after a failover due to curl failure. Or am I missing something that will set it back automatically?
I'm working on the code anyway, so I'm happy to make a PR if you think it's useful.
Right now, my first thought is to update the connection check lambdas so that the 1st time through, it checks the route table and if it's set to a NAT Gateway, change it to a NAT instance just before the first check, so if it's still down, it'll immediately be changed back. Effective, but will cause a connectivity blip every minute while failed over to NAT Gateway.
Option 2 is to have a separate lambda on a separate schedule (maybe every 15 minutes by default, or only on demand?) that if the route tables are using NAT Gateways, we run an "Instance Refresh" on the ASG, forcing it to re-create the instances. In theory, we could terminate the instances, and the ASG would do it's thing as well.
Thoughts?