Open choksi81 opened 10 years ago
Testing with "flapping private IP": My node is connected to a NAT forwarder when the interface goes down for 7 seconds. 127.0.0.1 is the new perceived node IP; looking up Afffix-enable doesn't succeed (obviously). When the interface comes back up, the old private IP is restored. Retrying to contact the NAT forwarder results in "Error: There is a duplicate connection which conflicts with the request!" -- seemingly, the old socket to the forwarder was never closed when the node IP changed, and we retry with the same source port number (cf. #1397). This error turns up multiple times as nmmain keeps retrying until it gives up. At the same time, since the old socket is still there, the connection to the NAT forwarder appears to never have gone down! The NAT forwarder didn't notice either: It received no FINs or RSTs, and the node was only shortly disconnected from the network so the forwarder's TCP stack didn't time out either. Consequently, I can still seash into the node once it is up again regardless of the fact that no new (post-flap) connection to the NAT forwarder could be established. Unless we code up a mobility affix that takes care of such situations, my proposal is to treat any change in IP address as a fatal, unrecoverable discontinuity to all current connections, which as a consequence must be torn down and set up anew. This causes chatter in the case of very frequent flapping (which I would consider unlikely, even given the connectivity patterns I exhibit), but makes the additional required logic stateless and thus rather simple.
Thanks to the recently added Affix support, the Seattle nodemanager is now contactable even when the node is behind a NAT. We now must ensure that transitions between NAT and connectivity states always result in the nodemanager being contactable.
Previously (without NAT traversal), the nodemanager would detect that its public node IP either changed to a different public node IP, or connectivity was interrupted altogether. In the latter case, the nodemanager would retry repeatedly to discover when connectivity was restored. When on the other hand the IP address changed, it would stop its current advertise thread, and start a new one that would advertise the node's new address and port.
With support for NAT traversal for the nodemanager, we have a lot of additional states and transitions between states that need to be considered: Private-to-new-private, public-to-private, private-to-public, and also "flapping" (on--off--on) connectivity with no IP address changes. We might notice the lack of connectivity or change of address in parts of the Affix stack before the main nodemanager logic triggers. This makes the problem a little more difficult.
Task: For all of the scenarios (those involving Affixes and those who don't), ensure that the nodemanager remains contactable after a few (tens of) seconds of reconfiguration. Also, make sure that the old advertised values are no longer advertised, and appropriate new values start (and then continue) to be advertised.