Sam Burnett found that his Raspberry Pi Seattle node would spawn lots of threads and crash after a few minutes. I debugged his device and installation remotely, and the problem appears in a situation when the Coordination Affix wants to look up the node's Zenodotus name (HASH.zenodotus.poly.edu), but the advertise thread didn't yet advertise that name. This creates multiple Affix stacks, some of which featuring nested !NatPunchAffixes, spawns hundreds of threads, and exhausts the available memory after 5-10 minutes, crashing the nodemanager.
If on the other hand the name is already advertised, e.g. because you just restarted the nodemanager after it hung/crashed, the problem would go away. You would need to wait until the node name expires from the advertise services to be able to trigger the problem again -- check dig +short thenodename to see if Zenodotus still resolves it.
The nodemanager log looks like this after such crash:
We've seen similar logs on (inaccessible) Android devices too. Detailed debug information from the !CoordinationAffix is available from the attached file.
We are still researching the exact mechanics of the problem to make sure the !CoordinationAffix "fails correctly" (instead of continuing despite an internal error and running in circles).
Sam Burnett found that his Raspberry Pi Seattle node would spawn lots of threads and crash after a few minutes. I debugged his device and installation remotely, and the problem appears in a situation when the Coordination Affix wants to look up the node's Zenodotus name (HASH.zenodotus.poly.edu), but the advertise thread didn't yet advertise that name. This creates multiple Affix stacks, some of which featuring nested !NatPunchAffixes, spawns hundreds of threads, and exhausts the available memory after 5-10 minutes, crashing the nodemanager.
If on the other hand the name is already advertised, e.g. because you just restarted the nodemanager after it hung/crashed, the problem would go away. You would need to wait until the node name expires from the advertise services to be able to trigger the problem again -- check
dig +short thenodename
to see if Zenodotus still resolves it.The nodemanager log looks like this after such crash:
We've seen similar logs on (inaccessible) Android devices too. Detailed debug information from the !CoordinationAffix is available from the attached file.
We are still researching the exact mechanics of the problem to make sure the !CoordinationAffix "fails correctly" (instead of continuing despite an internal error and running in circles).