Looping on "Sender is considered down, sending notification message"

jeromegn commented 1 year ago

I'm not sure how this situation happen, but with just 2 nodes, it appears that at some point after a few restarts something gets loopy and both nodes keep writing messages like:

INFO handle_data{len=82 header=Header { src: Actor { id: ActorId(d402fd8c-68fe-4ef8-9a72-5929e346c4d4), addr: [fdaa:0:3b99:a7b:139:f465:d5c8:2]:8787, bump: 25953 }, src_incarnation: 0, dst: Actor { id: ActorId(ee386b44-3d9d-4dec-af91-596a5d7b6323), addr: [fdaa:0:3b99:a7b:f8:dc04:c68a:2]:8787, bump: 60930 }, message: TurnUndead }}: foca: Sender is considered down, sending notification message

This is a new project using foca, and I've re-used much of the code I had for my first project. The main difference is I'm now leaving the cluster when exiting.

I'm on v0.11.

Any ideas?

caio commented 1 year ago

Hey there!

Let me see if I got it right: the scenario is you have a large cluster and after a restart many members started spitting these lines in their logs, accusing these two particular nodes of being down.

Sounds like a use of previously declared down identity. Here's what I think is happening:

The .bump field is initialized with a random value and when renewing, this field gets incremented by one (wrapping around on overflow).
When the node rejoins it gossips to a bunch of all known members so they pick up the new identity

When you restarted, these two nodes ended up with a .bump that they had used before and the process of finding a bump they could use was extremelly noisy: increment one, announce to the whole cluster, hear back that the new identity is also down, repeat.

If that's what's happening, all you need to do is guarantee that you don't reuse a bump while the cluster remembers seeing you with it in the past. A couple of suggestions:

Make Identity::renew use a random number every time instead of just incrementing it. This drastically reduces the noise in case of bump collision, so a restart leads to a down identity it's less likely that the renew will
Make .bump only ever grow. You could:
1. Use a timestamp instead of a random number: since the bump is used only to disambiguate addresses, worries about clock drift are minimal; Or
2. Persist the bump somewhere and keep it in sync

Always using random is simpler and, in fact, I should've done that for the examples too: it's way less problematic than random+increment in case of conflicts. I'll fix that soon.

jeromegn commented 1 year ago

This is a second project being tested on a much smaller scale. There were only 2 instances at the time this happened! Does that still fit your theory?

Thanks for the tips! Timestamp might work, trying to figure out what precision I need for both size and uniqueness considerations.

caio commented 1 year ago

Hah with just 2 nodes it's a lot less likely that the rand call led to repeated numbers. Of course my head went to the hard obscure and unlikely problem instead of the obvious one:

When you leave_cluster() foca sends a gossip to some peers so they learn about it; When the receiver processes the message it simply checks if the sender is active and since it isn't anymore, it sends the TurnUndead message

Here: https://github.com/caio/foca/blob/42b7946ad67220ec63f56007370699041c92873e/src/lib.rs#L794-L816

~~So, it's a bad interaction between leaving the cluster and the newish feature to notify down members. It's very noisy, but harmless. I'll address it this weekend.~~ Thanks a lot!

caio commented 1 year ago

Wait, the actual updates (with the "i'm leaving the cluster" data) are only applied after this logic (right after, in fact) so the above isn't actually happening. Or shouldn't.

If I had found an actual bug you would be able to consistently reproduce the issue by simply restarting one node. That isn't what's happening, right?

I'll give it a proper thought, but I'm back to thinking this is an unlucky rand. Maybe an accidental debug build using a rng with a fixed seed?

jeromegn commented 1 year ago

I have a bit more info: restarting nodes didn't seem to trigger the issue.

However, this morning there was a network issue with a region where one of the two nodes was running and then this happened:

both nodes suspected the other of being down, since neither could reach the other
both nodes declared the other down
both nodes detected they were "idle" since there's no other nodes in the cluster

about 1 second later, it tried to send a message to the dead node, not sure why:

Header { src: Actor { id: ActorId(d4bd6777-1a50-463a-a1b9-03ae5ed09863), addr: [fdaa:0:3b99:a7b:113:9dbf:5a1e:2]:8787, bump: 41981 }, src_incarnation: 0, dst: Actor { id: ActorId(53075418-8b09-4e83-9bca-dce06c48ac78), addr: [fdaa:0:3b99:a7b:139:98b9:d1f7:2]:8787, bump: 28611 }, message: TurnUndead }[1m}[0m[2m:[0m [2mfoca[0m[2m:[0m Sender is considered down, sending notification message

every 5 minutes, we try to bootstrap the cluster again by announcing to other nodes, this failed 2 times in a row while the network issues persisted.

on the 3rd attempt, it was finally able to reach the other node and then the flurry of TurnUndead started again:

Header { src: Actor { id: ActorId(d4bd6777-1a50-463a-a1b9-03ae5ed09863), addr: [fdaa:0:3b99:a7b:113:9dbf:5a1e:2]:8787, bump: 41981 }, src_incarnation: 0, dst: Actor { id: ActorId(53075418-8b09-4e83-9bca-dce06c48ac78), addr: [fdaa:0:3b99:a7b:139:98b9:d1f7:2]:8787, bump: 28611 }, message: TurnUndead }[1m}[0m[2m:[0m [2mfoca[0m[2m:[0m Sender is considered down, sending notification message

It does keep the same bump, instead of the wrapping_add when renewing the identity. Should each node renew its identity once they detect themselves as idle?

caio commented 1 year ago

Ah, of course!

Thanks a lot for the extra info: when the nodes consider each other down they end up never recovering:

Node A sends announce to B
Node B thinks A is down, so sends TurnUndead and ignores whatever the message was
Node A receives TurnUndead, but thinks B is down so sends TurnUndead and ignores whatever the message was
Back to 2 :thisisnotfine:

It does keep the same bump, instead of the wrapping_add when renewing the identity. Should each node renew its identity once they detect themselves as idle?

Idle here simply means that other members are not active anymore; This case was because of an outage, but could've been a normal exit or the other node genuinely went down.

Interestingly though, if foca did that you wouldn't be having this problem :grin: But only because the cluster size is 2. For K=4 a 2-partition would lead to the neverending TurnUndead spam.

I think the correct fix is to simply check if the message we're rejecting is a TurnUndead and then renew, pretty easy to do. I'll sleep on it and ship a fix in the morning if I can´t figure a problem with this.

caio commented 1 year ago

The fix is at https://github.com/caio/foca/commit/8f8b0bd5250e1f56fe4e8ea7a230e45e618aa728

I'm baking v0.12 with it and will auto-close this ticket when released. Feel free to reopen if it doesn't actually fix what you're seeing.

caio / foca

Looping on "Sender is considered down, sending notification message" #25