Closed jeromegn closed 1 year ago
Hey there!
Let me see if I got it right: the scenario is you have a large cluster and after a restart many members started spitting these lines in their logs, accusing these two particular nodes of being down.
Sounds like a use of previously declared down identity. Here's what I think is happening:
.bump
field is initialized with a random value and when renewing, this field gets incremented by one (wrapping around on overflow).When you restarted, these two nodes ended up with a .bump
that they had used before and the process of finding a bump they could use was extremelly noisy: increment one, announce to the whole cluster, hear back that the new identity is also down, repeat.
If that's what's happening, all you need to do is guarantee that you don't reuse a bump while the cluster remembers seeing you with it in the past. A couple of suggestions:
Identity::renew
use a random number every time instead of just incrementing it. This drastically reduces the noise in case of bump collision, so a restart leads to a down identity it's less likely that the renew will.bump
only ever grow. You could:
Always using random is simpler and, in fact, I should've done that for the examples too: it's way less problematic than random+increment in case of conflicts. I'll fix that soon.
This is a second project being tested on a much smaller scale. There were only 2 instances at the time this happened! Does that still fit your theory?
Thanks for the tips! Timestamp might work, trying to figure out what precision I need for both size and uniqueness considerations.
Hah with just 2 nodes it's a lot less likely that the rand call led to repeated numbers. Of course my head went to the hard obscure and unlikely problem instead of the obvious one:
When you leave_cluster() foca sends a gossip to some peers so they learn about it; When the receiver processes the message it simply checks if the sender is active and since it isn't anymore, it sends the TurnUndead message
Here: https://github.com/caio/foca/blob/42b7946ad67220ec63f56007370699041c92873e/src/lib.rs#L794-L816
So, it's a bad interaction between leaving the cluster and the newish feature to notify down members. It's very noisy, but harmless. I'll address it this weekend. Thanks a lot!
Wait, the actual updates (with the "i'm leaving the cluster" data) are only applied after this logic (right after, in fact) so the above isn't actually happening. Or shouldn't.
If I had found an actual bug you would be able to consistently reproduce the issue by simply restarting one node. That isn't what's happening, right?
I'll give it a proper thought, but I'm back to thinking this is an unlucky rand. Maybe an accidental debug build using a rng with a fixed seed?
I have a bit more info: restarting nodes didn't seem to trigger the issue.
However, this morning there was a network issue with a region where one of the two nodes was running and then this happened:
Header { src: Actor { id: ActorId(d4bd6777-1a50-463a-a1b9-03ae5ed09863), addr: [fdaa:0:3b99:a7b:113:9dbf:5a1e:2]:8787, bump: 41981 }, src_incarnation: 0, dst: Actor { id: ActorId(53075418-8b09-4e83-9bca-dce06c48ac78), addr: [fdaa:0:3b99:a7b:139:98b9:d1f7:2]:8787, bump: 28611 }, message: TurnUndead }[1m}[0m[2m:[0m [2mfoca[0m[2m:[0m Sender is considered down, sending notification message
Header { src: Actor { id: ActorId(d4bd6777-1a50-463a-a1b9-03ae5ed09863), addr: [fdaa:0:3b99:a7b:113:9dbf:5a1e:2]:8787, bump: 41981 }, src_incarnation: 0, dst: Actor { id: ActorId(53075418-8b09-4e83-9bca-dce06c48ac78), addr: [fdaa:0:3b99:a7b:139:98b9:d1f7:2]:8787, bump: 28611 }, message: TurnUndead }[1m}[0m[2m:[0m [2mfoca[0m[2m:[0m Sender is considered down, sending notification message
It does keep the same bump
, instead of the wrapping_add
when renewing the identity. Should each node renew its identity once they detect themselves as idle?
Ah, of course!
Thanks a lot for the extra info: when the nodes consider each other down they end up never recovering:
It does keep the same
bump
, instead of thewrapping_add
when renewing the identity. Should each node renew its identity once they detect themselves as idle?
Idle here simply means that other members are not active anymore; This case was because of an outage, but could've been a normal exit or the other node genuinely went down.
Interestingly though, if foca did that you wouldn't be having this problem :grin: But only because the cluster size is 2. For K=4 a 2-partition would lead to the neverending TurnUndead spam.
I think the correct fix is to simply check if the message we're rejecting is a TurnUndead and then renew, pretty easy to do. I'll sleep on it and ship a fix in the morning if I can´t figure a problem with this.
The fix is at https://github.com/caio/foca/commit/8f8b0bd5250e1f56fe4e8ea7a230e45e618aa728
I'm baking v0.12 with it and will auto-close this ticket when released. Feel free to reopen if it doesn't actually fix what you're seeing.
I'm not sure how this situation happen, but with just 2 nodes, it appears that at some point after a few restarts something gets loopy and both nodes keep writing messages like:
This is a new project using
foca
, and I've re-used much of the code I had for my first project. The main difference is I'm now leaving the cluster when exiting.I'm on
v0.11
.Any ideas?