Nodemanager crashes due to race condition in CoordinationAffix

Sam Burnett found that his Raspberry Pi Seattle node would spawn lots of threads and crash after a few minutes. I debugged his device and installation remotely, and the problem appears in a situation when the Coordination Affix wants to look up the node's Zenodotus name (HASH.zenodotus.poly.edu), but the advertise thread didn't yet advertise that name. This creates multiple Affix stacks, some of which featuring nested !NatPunchAffixes, spawns hundreds of threads, and exhausts the available memory after 5-10 minutes, crashing the nodemanager.

If on the other hand the name is already advertised, e.g. because you just restarted the nodemanager after it hung/crashed, the problem would go away. You would need to wait until the node name expires from the advertise services to be able to trigger the problem again -- check dig +short thenodename to see if Zenodotus still resolves it.

The nodemanager log looks like this after such crash:

1394836840.12:PID-17871:["2.7.3"
1394836840.12:PID-17871:[INFO](INFO]:platform.python_version():):platform.platform(): "Linux-3.6.11+-armv6l-with-debian-7.1"
1394836840.13:PID-17871:["('Linux', 'raspberrypi', '3.6.11+', '#538 PREEMPT Fri Aug 30 20:42:08 BST 2013', 'armv6l', '')"
1394836840.13:PID-17871:[INFO](INFO]:platform.uname():):Loading config
1394836851.16:PID-17871:[INFO]: Current advertised Affix string: (NatDeciderAffix)

We've seen similar logs on (inaccessible) Android devices too. Detailed debug information from the !CoordinationAffix is available from the attached file.

We are still researching the exact mechanics of the problem to make sure the !CoordinationAffix "fails correctly" (instead of continuing despite an internal error and running in circles).

SeattleTestbed / attic

Nodemanager crashes due to race condition in CoordinationAffix #1385