Unable to remove cluster member

XRPLF / rippled

Decentralized cryptocurrency blockchain daemon implementing the XRP Ledger protocol in C++

https://xrpl.org

ISC License

4.52k stars 1.47k forks source link

Unable to remove cluster member #3185

Open wilsonianb opened 4 years ago

wilsonianb commented 4 years ago

https://github.com/ripple/rippled/tree/develop/src/ripple/overlay#configuration

Because cluster members can introduce other cluster members, it is not necessary to configure every cluster member on every other cluster member.

:point_up_2: This seems to make a rippled cluster operator unable to removing a cluster member without simultaneously restarting all rippleds still in the cluster.

Rolling restarts after removing the cluster member from [cluster_nodes] will cause the still running rippled(s) to tell the restarted rippled that the removed member is still in the cluster.

MarkusTeufelberger commented 4 years ago

A safer behavior would be to only communicate cluster members to other members that are locally configured, not ones that are "learned" from others. Maybe this is already the case, I haven't checked this in code or through experiments. If it isn't, then this would probably be a fix (remove a cluster member from all nodes that have it configured, restart those, then restart all remaining nodes).

nbougalis commented 4 years ago

Thanks @wilsonianb. I agree this is a problematic design and we should consider how to fix it. To be honest, I don't quite like how clusters are done, nor do I care very much for the way we communicate cluster updates. Consider, instead, something UNL-like:

[cluster]
uri="http://example.com/cluster/foo.toml"
pubkey="5266556A586E327235753778214125442A472D4B6150645367566B5970337336"

The server can then retrieve the specified file, verify it's signed with the appropriate key and import the list of cluster nodes from that.

I guess one complaint/issue might be that putting up a website is a problem/pain? rippled could push the updated list out to other cluster members over the peer gossip protocol too, similar to how #3072 has a server propagate its trusted UNLs over the peer gossip protocol.

wilsonianb commented 4 years ago

I think it's more important to borrow the live updating aspect of UNLs than the signed remote/forwarded list since the operator should be in control of all clustered nodes and can use existing configuration management solutions (like Ansible, etc) to update local files.

Maybe a local cluster configuration file (with both [ips_fixed] and [cluster_nodes]) could be "included" in rippled.cfg (https://github.com/ripple/rippled/issues/2956)? But then that also doesn't necessarily help with live updating :thinking: