caio / foca

mirror of https://caio.co/de/foca/
https://caio.co/de/foca/
Mozilla Public License 2.0
147 stars 8 forks source link

Looping on "Sender is considered down, sending notification message" #25

Closed jeromegn closed 1 year ago

jeromegn commented 1 year ago

I'm not sure how this situation happen, but with just 2 nodes, it appears that at some point after a few restarts something gets loopy and both nodes keep writing messages like:

INFO handle_data{len=82 header=Header { src: Actor { id: ActorId(d402fd8c-68fe-4ef8-9a72-5929e346c4d4), addr: [fdaa:0:3b99:a7b:139:f465:d5c8:2]:8787, bump: 25953 }, src_incarnation: 0, dst: Actor { id: ActorId(ee386b44-3d9d-4dec-af91-596a5d7b6323), addr: [fdaa:0:3b99:a7b:f8:dc04:c68a:2]:8787, bump: 60930 }, message: TurnUndead }}: foca: Sender is considered down, sending notification message

This is a new project using foca, and I've re-used much of the code I had for my first project. The main difference is I'm now leaving the cluster when exiting.

I'm on v0.11.

Any ideas?

caio commented 1 year ago

Hey there!

Let me see if I got it right: the scenario is you have a large cluster and after a restart many members started spitting these lines in their logs, accusing these two particular nodes of being down.

Sounds like a use of previously declared down identity. Here's what I think is happening:

When you restarted, these two nodes ended up with a .bump that they had used before and the process of finding a bump they could use was extremelly noisy: increment one, announce to the whole cluster, hear back that the new identity is also down, repeat.

If that's what's happening, all you need to do is guarantee that you don't reuse a bump while the cluster remembers seeing you with it in the past. A couple of suggestions:

Always using random is simpler and, in fact, I should've done that for the examples too: it's way less problematic than random+increment in case of conflicts. I'll fix that soon.

jeromegn commented 1 year ago

This is a second project being tested on a much smaller scale. There were only 2 instances at the time this happened! Does that still fit your theory?

Thanks for the tips! Timestamp might work, trying to figure out what precision I need for both size and uniqueness considerations.

caio commented 1 year ago

Hah with just 2 nodes it's a lot less likely that the rand call led to repeated numbers. Of course my head went to the hard obscure and unlikely problem instead of the obvious one:

When you leave_cluster() foca sends a gossip to some peers so they learn about it; When the receiver processes the message it simply checks if the sender is active and since it isn't anymore, it sends the TurnUndead message

Here: https://github.com/caio/foca/blob/42b7946ad67220ec63f56007370699041c92873e/src/lib.rs#L794-L816

So, it's a bad interaction between leaving the cluster and the newish feature to notify down members. It's very noisy, but harmless. I'll address it this weekend. Thanks a lot!

caio commented 1 year ago

Wait, the actual updates (with the "i'm leaving the cluster" data) are only applied after this logic (right after, in fact) so the above isn't actually happening. Or shouldn't.

If I had found an actual bug you would be able to consistently reproduce the issue by simply restarting one node. That isn't what's happening, right?

I'll give it a proper thought, but I'm back to thinking this is an unlucky rand. Maybe an accidental debug build using a rng with a fixed seed?

jeromegn commented 1 year ago

I have a bit more info: restarting nodes didn't seem to trigger the issue.

However, this morning there was a network issue with a region where one of the two nodes was running and then this happened:

It does keep the same bump, instead of the wrapping_add when renewing the identity. Should each node renew its identity once they detect themselves as idle?

caio commented 1 year ago

Ah, of course!

Thanks a lot for the extra info: when the nodes consider each other down they end up never recovering:

  1. Node A sends announce to B
  2. Node B thinks A is down, so sends TurnUndead and ignores whatever the message was
  3. Node A receives TurnUndead, but thinks B is down so sends TurnUndead and ignores whatever the message was
  4. Back to 2 :thisisnotfine:

It does keep the same bump, instead of the wrapping_add when renewing the identity. Should each node renew its identity once they detect themselves as idle?

Idle here simply means that other members are not active anymore; This case was because of an outage, but could've been a normal exit or the other node genuinely went down.

Interestingly though, if foca did that you wouldn't be having this problem :grin: But only because the cluster size is 2. For K=4 a 2-partition would lead to the neverending TurnUndead spam.

I think the correct fix is to simply check if the message we're rejecting is a TurnUndead and then renew, pretty easy to do. I'll sleep on it and ship a fix in the morning if I can´t figure a problem with this.

caio commented 1 year ago

The fix is at https://github.com/caio/foca/commit/8f8b0bd5250e1f56fe4e8ea7a230e45e618aa728

I'm baking v0.12 with it and will auto-close this ticket when released. Feel free to reopen if it doesn't actually fix what you're seeing.