asynkron / protoactor-dotnet

Proto Actor - Ultra fast distributed actors for Go, C# and Java/Kotlin
http://proto.actor
Apache License 2.0
1.73k stars 288 forks source link

SeedNode provider conceptually broken #2060

Closed rogeralsing closed 9 months ago

rogeralsing commented 1 year ago

I've found a bug in the seed node provider which is somewhat hard to fix with the current design.

The gossip implementation uses data from the memberlist to know what members exist. whenever a gossip request arrives, and the local gossip state is updated, it will use the memberlist to filter out gossip state for members that do not exist.

The issue is that this data is the data we are supposed to use to feed into the member list. so a form of catch-22.

e.g.

Node B and C connects to seed node A A knows about B and C and propagates this information to both B and C B does not yet know about C, and C does not yet know about B. thus C is not in the B memberlist and vice versa. when the gossip data arrives from A to B, B will filter out the data for C as it is an unknown member. this is to prevent bad gossip state to propagating through the cluster. e.g. stale data or invalid data.

this works fine for other providers, where the source of truth for members is elsewhere. (now writing this. I realize that it might be an issue there as well for other cluster providers, if the gossip arrives before a node has gotten the member updates from the cluster provider... but I need to verify if this is the case)

rogeralsing commented 1 year ago

Some updates here. I made it so that gossip from unknown members is ignored. and the sender is notified that the request was rejected. if the target node later gets notified by the cluster provider that the sender is indeed a member, the sender will re-send the gossip and it will now be accepted by the target.

I do believe this was an actual bug for other providers also. meaning that gossip state that was sent before the target knew about the sender as a member, that state got dropped.

rogeralsing commented 1 year ago

For the seednode provider, I'm thinking that the gossip layer could raise an GossipRejected event for this scenario. And the seednode provider could subscribe to this and explicitly add the sender as a member to the memberlist.

Next time the sender tries to resend, the request will now be accepted

AqlaSolutions commented 10 months ago

Any ETA on the fix release, please? So we can decide whether to wait or use different cluster provider types.

rogeralsing commented 9 months ago

We merged a new version of the seed node provider yesterday. still things to fix, but the base is there for local development.

Things that need to be completed

rogeralsing commented 9 months ago

Closing as it is now at least more working that initially