=cluster handle aggressively rejoining/replacing nodes from same host/port pair

ktoso commented 1 year ago

Resolves : https://github.com/apple/swift-distributed-actors/issues/1082

Short version: there is an edge case that was not handled well when very aggressively rejoining the cluster from the same host/port.

This can happen in k8s when a pod gets aggressively restarted, or on command line apps when someone joins a cluster "sends just one request and kills the app" since they both then still may be present in SWIM gossip (correctly) as dead, and the cluster may misinterpret this about information about "itself" when the new node joins.

More analysis:

steps:

7 joins 8
7 is leader
- minumum 2 nodes before we elect one
- lower address wins
7 dies
8 cannot declare 7 as down
- only leader can do this
- this is ok, as designed // these systems expect tog et back their node count and then recover
7 reboot -- let's call it 77
- same host/port
- new UID
handshake with 8
- 8 accepts
- 77 gets accept
- 8 declares "previous 7" as down, since 77 is the replacement
8 declaring 7 down is correct
- but it means we have a down 7 in membership
- this is also correct; other nodes may not yet know about this, so we want to spread this down information that 8 first noticed
gossip in includes old node 7 (okey)
- Node 77 receives gossip through SWIM and that includes 7:
- 2022-11-01T13:08:56+0900 trace Client : actor/id=/user/swim actor/path=/user/swim cluster/node=sact://REPLACEMENT_77@127.0.0.1:7337 swim/incarnation=0 swim/members/all=["SWIM.Member(SWIMActor(id:sact://RemoteCluster:7602583950674506995@127.0.0.1:8337/user/swim, node:sact://RemoteCluster:7602583950674506995@127.0.0.1:8337, alive(incarnation: 0), protocolPeriod: 1)", "SWIM.Member(SWIMActor(id:/user/swim, node:sact://REPLACEMENT_77@127.0.0.1:7337, alive(incarnation: 0), protocolPeriod: 0)"] swim/members/count=2 swim/ping/origin=sact://RemoteCluster:7602583950674506995@127.0.0.1:8337/user/swim swim/ping/payload=membership([SWIM.Member(SWIMActor(id:sact://OLD_NODE_7@127.0.0.1:7337/user/swim, node:sact://OLD_NODE_7@127.0.0.1:7337, suspect(incarnation: 0, suspectedBy: Set([sact://sact@127.0.0.1:8337#7602583950674506995])), protocolPeriod: 56), SWIM.Member(SWIMActor(id:sact://RemoteCluster:7602583950674506995@127.0.0.1:8337/user/swim, node:sact://RemoteCluster:7602583950674506995@127.0.0.1:8337, alive(incarnation: 0), protocolPeriod: 0), SWIM.Member(SWIMActor(id:/user/swim, node:sact://REPLACEMENT_77@127.0.0.1:7337, alive(incarnation: 0), protocolPeriod: 56)]) swim/ping/seqNr=4 swim/protocolPeriod=1 swim/suspects/count=0 swim/timeoutSuspectsBeforePeriodMax=11 swim/timeoutSuspectsBeforePeriodMin=4 [DistributedCluster] Received ping@4
- Note this: swim/ping/payload=membership([
- SWIM.Member(SWIMActor(id:sact://OLD_NODE_7@127.0.0.1:7337/user/swim, node:sact://OLD_NODE_7@127.0.0.1:7337, suspect(incarnation: 0, suspectedBy: Set([sact://sact@127.0.0.1:8337#7602583950674506995])), protocolPeriod: 56),
- SWIM.Member(SWIMActor(id:sact://REPLACEMENT_77@127.0.0.1:7337/user/swim, node:sact://REPLACEMENT_77@127.0.0.1:7337, alive(incarnation: 0), protocolPeriod: 56),
- SWIM.Member(SWIMActor(id:sact://RemoteCluster:7602583950674506995@127.0.0.1:8337/user/swim, node:sact://RemoteCluster:7602583950674506995@127.0.0.1:8337, alive(incarnation: 0), protocolPeriod: 0)
- In other words, SWIM spreads information about both nodes since it is not confirmDead yet -- THIS IS OK. But continuing to act on the removed node's information is NOT ok.

Long story short: SWIM tells us that a node on this address was dead, but we know we are not dead -- this should only happen on high level gossip, when we see a .down somewhere about us. So we can ignore this from the SWIM level.

This should also get fixed in SWIM itself though, I'll follow up there.

ktoso commented 1 year ago

SWIM follow up https://github.com/apple/swift-cluster-membership/issues/91

ktoso commented 1 year ago

How I love a trailing space being the only failure :P

apple / swift-distributed-actors

=cluster handle aggressively rejoining/replacing nodes from same host/port pair #1083