Open wilsonianb opened 4 years ago
A safer behavior would be to only communicate cluster members to other members that are locally configured, not ones that are "learned" from others. Maybe this is already the case, I haven't checked this in code or through experiments. If it isn't, then this would probably be a fix (remove a cluster member from all nodes that have it configured, restart those, then restart all remaining nodes).
Thanks @wilsonianb. I agree this is a problematic design and we should consider how to fix it. To be honest, I don't quite like how clusters are done, nor do I care very much for the way we communicate cluster updates. Consider, instead, something UNL-like:
[cluster]
uri="http://example.com/cluster/foo.toml"
pubkey="5266556A586E327235753778214125442A472D4B6150645367566B5970337336"
The server can then retrieve the specified file, verify it's signed with the appropriate key and import the list of cluster nodes from that.
I guess one complaint/issue might be that putting up a website is a problem/pain? rippled
could push the updated list out to other cluster members over the peer gossip protocol too, similar to how #3072 has a server propagate its trusted UNLs over the peer gossip protocol.
I think it's more important to borrow the live updating aspect of UNLs than the signed remote/forwarded list since the operator should be in control of all clustered nodes and can use existing configuration management solutions (like Ansible, etc) to update local files.
Maybe a local cluster configuration file (with both [ips_fixed]
and [cluster_nodes]
) could be "included" in rippled.cfg (https://github.com/ripple/rippled/issues/2956)? But then that also doesn't necessarily help with live updating :thinking:
https://github.com/ripple/rippled/tree/develop/src/ripple/overlay#configuration
:point_up_2: This seems to make a rippled cluster operator unable to removing a cluster member without simultaneously restarting all rippleds still in the cluster.
Rolling restarts after removing the cluster member from
[cluster_nodes]
will cause the still running rippled(s) to tell the restarted rippled that the removed member is still in the cluster.