apple / swift-distributed-actors

Peer-to-peer cluster implementation for Swift Distributed Actors
https://apple.github.io/swift-distributed-actors/
Apache License 2.0
587 stars 55 forks source link

=cluster handle aggressively rejoining/replacing nodes from same host/port pair #1083

Closed ktoso closed 1 year ago

ktoso commented 1 year ago

Resolves : https://github.com/apple/swift-distributed-actors/issues/1082


Short version: there is an edge case that was not handled well when very aggressively rejoining the cluster from the same host/port.

This can happen in k8s when a pod gets aggressively restarted, or on command line apps when someone joins a cluster "sends just one request and kills the app" since they both then still may be present in SWIM gossip (correctly) as dead, and the cluster may misinterpret this about information about "itself" when the new node joins.


More analysis:

steps:

Long story short: SWIM tells us that a node on this address was dead, but we know we are not dead -- this should only happen on high level gossip, when we see a .down somewhere about us. So we can ignore this from the SWIM level.

This should also get fixed in SWIM itself though, I'll follow up there.

ktoso commented 1 year ago

SWIM follow up https://github.com/apple/swift-cluster-membership/issues/91

ktoso commented 1 year ago

How I love a trailing space being the only failure :P