LeonHartley / Coerce-rs

Actor runtime and distributed systems framework for Rust
713 stars 23 forks source link

Infinite reconnect attempt loop, preventing shard rebalancing #24

Closed RedKinda closed 1 year ago

RedKinda commented 1 year ago

Steps to reproduce:

Now, the system that is still running will enter an infinite reconnect loop, instead of firing a Forget so that the shards can be rebalanced. The loop is between the Connect and Disconnect events on the RemoteClient handle. I think it makes sense to introduce a Retry in a config somewhere and Forget it after n retries.

Also, stopping the second node and subsequently starting it gets the system into a correct state again, where the node picks up both shards from redis, and then starting a second node rebalances correctly as well.

Awesome library btw :) excited to see more

LeonHartley commented 1 year ago

Hey @RedKinda, thanks a lot for checking it out!

I've tried a few times to reproduce this locally but still unable to, plus this scenario is unit tested: https://github.com/LeonHartley/Coerce-rs/blob/master/coerce/tests/test_remote_sharding_rebalancing.rs#L188

Relevant logs: https://gist.github.com/LeonHartley/ecc2369614b060731a00ab5207d6ffaf

The RemoteClient actor will continue to try to re-establish the connection every interval but it should still be emitting the Forget message immediately. I will definitely look at making more things configurable, perhaps also add a back off / maximum connection attempts.

I'm wondering why it's failing for you - what platform are you running this on, and if possible please could you provide me with the logs? I'd love to get to the bottom of this! :-)

Thanks again, Leon

LeonHartley commented 1 year ago

As soon as the client is disconnected, the client will change its state to Idle, then upon the next client-side ping tick, it should emit the Forget message, provided that the state is still Idle:

https://github.com/LeonHartley/Coerce-rs/blob/master/coerce/src/remote/net/client/connect.rs#L322 https://github.com/LeonHartley/Coerce-rs/blob/master/coerce/src/remote/net/client/ping.rs#L44

RedKinda commented 1 year ago

Hmmmmmmm I can't seem to be able to reproduce this anymore :thinking: I was running on a slightly older version at that time, possibly fixed in the meantime? I'll give it a spin later in the week but it might have been just temporary