gossip: revisit bootstrap address persistence

tbg commented 7 years ago

In https://github.com/cockroachdb/cockroach/pull/3711, we added functionality that lets gossip store its peers to disk so that they can be re-used when the server is restarted (perhaps without join flags).

But this mechanism is problematic. Some examples (all variations of more or less the same theme):

it adds a degree of magic that can be confusing. "Why is my cluster trying to connect to some address I didn't specify?"
it can be dangerous: say you remove a node from a cluster, update all your deployments, and then use that node in another cluster. Magically, they'll still (at least try) to talk to one another. Even if we manage to handle it gracefully (doubt we don't at the moment), it seems bad.
it can be dangerous #2: now that nodes can be running different versions, we need to be more careful.
in parallel acceptance testing, nodes can end up talking to old ports that are now used by a different cluster.

As far as I know, we don't suggest to users that they don't specify the --join flags. As in, we have avoided mentioning the magic. For that reason, my hunch is that we should simply rip out https://github.com/cockroachdb/cockroach/pull/3711. If I'm overlooking some reason for which this must remain in the code, at least we should be able to disable it.

tbg commented 7 years ago

There could be a migration concern for existing clusters that only "work" because of this: in that case we want to leave in the code that reads the storage, but we remove the code that writes it.

bdarnell commented 7 years ago

For more background see #2938.

We don't suggest that users omit the --join flags on restart, but I think we might suggest that they can be lazy about updating them. With persistent gossip bootstrapping, you only have to ensure that your join flags are up to date when you add a new node; without it you need to do so any time a node restarts. Removing this gossip persistence mechanism would probably result in more frequent node restarts to keep configs up to date in some deployment scenarios.

But I'm not sure that's a good enough reason to keep this bit of magic given the downsides you mention.

a-robinson commented 7 years ago

I'm not sold on this. Being able to just specify one node in your join flags is really convenient and simplifies setup, but if we don't maintain a persistent bootstrap list it creates a single point of failure during restarts.

To respond to your points:

I haven't seen or heard this come up as an issue. Has it been common?
Yup, we need to handle this more gracefully. But we could achieve the same purpose by removing addresses that are no longer associated with a live node from the bootstrap list.
I don't know that bootstrap addresses add much risk here. We already need to protect against such issues that operator error could cause just as easily.
This is annoying, but doesn't seem worth making things harder on users. There must be other ways of improving the test infrastructure.

rjnn commented 7 years ago

I'm fine with this feature remaining, as long as it can be disabled on node startup. Consider the following scenario:

Nodes 1,2, and 3 form a cluster (call it orig). We shut them down, clone the store directories, start them back up, and then start up nodes clone-1, clone-2, and clone-3. Why would I do this? Well I already do this regularly with my postgres-based website because I want my dev database to be pretty close to the production database. I get that I should be using backups or dumps, but the former is an enterprise feature, and the latter is potentially much slower than a disk copy (and can be parallelized across the cluster, as long as you're willing to suffer some downtime).

However, when I start clone-1, I need to make sure it never talks to orig-2 and orig-3. If I mess up the network config and don't sandbox them, on startup it's going to immediately connect. I don't know what happens when two copies of a single node both join the gossip network, but I imagine it's not good, and potentially hard to recover from. At the very least, it would be nice to have a flag such as --cleanslate that tells the node on startup to never talk to anyone that's not explicitly in the join flag.

tbg commented 7 years ago

I'm not sold on this. Being able to just specify one node in your join flags is really convenient and simplifies setup, but if we don't maintain a persistent bootstrap list it creates a single point of failure during restarts.

That's a good point, but I think the problem are the gossip connection semantics itself. What we want is to be able to say "just point the node anywhere in the cluster, and things will be fine", but that's not true (as you know). Below are some thoughts on how I think we might address this. The reason I post it here is that I think addressing it suitably removes the need for bootstrap persistence and avoids FUD around join flags. Curious what you think.

For anyone just popping in, the problem is that if you have, say, five nodes and you point the first three of them at each other and the remaining two of them as well, then it obviously can't work out because this graph (as an undirected graph) has two connected components. But even if you bridge the gap, it's not obvious that things will come together efficiently because of per-node connection limits: Say everyone knows only node 1 and nobody else; node 1 may start to refuse connections. It does so by redirecting to other nodes, but you could imagine cooking up a scenario that has two components only connected by two nodes that know each other. If they are both full, they'll never allow other clients to bridge the divide. If they don't happen to be connected, the gossip network is disconnected. This situation should resolve itself because only one side will have the up-to-date first range gossip and so the other side will try to reach out to other nodes (which eventually the "gateway node" should also do), but it could take a while.

This is the case that's made more unlikely by persistence: each node basically memorizes everyone it's talked to when it was last running, so once the hypothetical divide has been crossed once, it'll remain crossed across node restarts. That is, unless some network problem lets enough nodes scratch the crucial entries from the bootstrap info.

I still feel fairly strongly that the persistence is just something that glosses over the underlying deficiency discussed above and since it doesn't solve the problem but rather just masks it and makes it less straightforward to reason about.

Formally, I believe that what we're after is that the (undirected) graph formed by the join flags (where node X having node Y in its join flags has an edge from X to Y) is connected (as in, for any pair of nodes, there's either an undirected path connecting them). In fact, for resilience we want it to be k-connected for k as large as possible (maximally connected: #nodes-1-connected).

For example, the typical setup in which the node with id N joins all smaller ids is maximally connected: Each node has N-1 edges (because node i joins i-1 nodes and is joined by n-i nodes) and you have to remove them all or the node will remain connected.

1----2----3----4----5
|    |    |    |    |
+----+----+    |    |
+----+----+----+    |
+----+----+----+----+

Similarly, the case in which there's a round robin would also expand to "maximally connected" appropriately.

On the other hand, having all nodes join only the first node is a bad choice because it's 1-connected: remove a single edge and you have an isolated node. Even worse, imagine the first node is down; then nobody can connect to anyone else. So this pattern is clearly something to avoid, both in tutorials and production (note how the explicit init step in v1.1 can easily specify fully-connected join flags).

Once we're more principled about which join setups we suggest, we can also highlight bad configurations as a cluster problem in the admin UI. It's straightforward to compute whether a cluster is maximally connected (via its live nodes).

NB: The graph is treated as undirected by operating under the assumption that if nodeA can reach out to nodeB, it will. This assumption isn't strictly true, but can be made true (for the purposes of exchanging join flags, perhaps replacing the persistence mechanism, though DNS/load balancer addresses would have to be special-cased to avoid "infinite work"). Treating the graph as undirected is too restrictive a model and usually results in 1-connected graphs for generally highly functional setups.

See also https://github.com/cockroachdb/cockroach/issues/15538, which is perhaps somehow related to solving this.

bdarnell commented 7 years ago

Nodes 1,2, and 3 form a cluster (call it orig). We shut them down, clone the store directories, start them back up, and then start up nodes clone-1, clone-2, and clone-3.

Note that doing this without taking down all three nodes simultaneously can result in inconsistencies.

+1 on at least having an option to disregard persisted gossip addresses and use only the new command line.

For example, the typical setup in which the node with id N joins all smaller ids is maximally connected

But what does the first node do? When it crashes, the remaining N-1 nodes will reconnect among themselves to ensure that they have the desired number of connections, and then no one will try to reconnect to the first node until something changes. Either we need to proactively reconnect periodically to all of our join targets (ensuring that those edges are always active) or some nodes need to be given join targets with higher IDs. My general preference is to use the first K nodes as join targets for all nodes.

a-robinson commented 7 years ago

So if I'm reading it correctly, your suggestion is essentially to change our examples/instructions and to highlight bad configurations in the UI. Stated as a tradeoff, would it be fair to say that doing so would be adding to the cognitive burden of setting up and operating Cockroach in order to keep us from needing to harden the system against such problems? If we really don't think that we can harden the system, then that tradeoff may be worth making. I'm not yet convinced that we can't do so, though.

Formally, I believe that what we're after is that the (undirected) graph formed by the join flags (where node X having node Y in its join flags has an edge from X to Y) is connected (as in, for any pair of nodes, there's either an undirected path connecting them). In fact, for resilience we want it to be k-connected for k as large as possible (maximally connected:

nodes-1-connected).

That may be what we're after, but as I think you're aware, we currently need the directed graph formed by join flags to be connected (https://github.com/cockroachdb/cockroach/issues/18027, https://github.com/cockroachdb/cockroach/issues/13027), with the caveat that each node with a copy of the first range can be considered as having an edge pointing to all nodes that have an edge to it.

a-robinson commented 7 years ago

@arjunravinarayan that's a good scenario. A flag would be one way around it. Another would be to ignore previously stored bootstrap addresses if the node is running at a different address than it has previously stored as its own.

tbg commented 7 years ago

So if I'm reading it correctly, your suggestion is essentially to change our examples/instructions and to highlight bad configurations in the UI.

No, I probably didn't express this clearly. I'd like to end up with something that's safe and easy to use. I think that will mean that we'll have to be a little more principled in how we tell users to set up their clusters, but hopefully not too restrictive. For instance, if you provision a mostly-fixed set of nodes, there's no good reason to not point "everyone at everyone". I just don't consider our current mechanism safe and so the leniency with which we instruct our users rubs me the wrong way.

If we really don't think that we can harden the system, then that tradeoff may be worth making. I'm not yet convinced that we can't do so, though.

My peeve with the persistence is that (imo) it isn't a hardening, it's a sweep-under-the-ruggening because it doesn't make these configurations safe, it just makes them seem safe (and breaking rarely, supposedly). There are also pet peeves like spurious errors in dynamically provisioned environments, where nodes will routinely try to connect to old nodes, but I have less of a strong argument against those after https://github.com/cockroachdb/cockroach/issues/18058

But what does the first node do? When it crashes, the remaining N-1 nodes will reconnect among themselves to ensure that they have the desired number of connections, and then no one will try to reconnect to the first node until something changes. Either we need to proactively reconnect periodically to all of our join targets (ensuring that those edges are always active) or some nodes need to be given join targets with higher IDs. My general preference is to use the first K nodes as join targets for all nodes.

Yeah, that's pretty unfortunate. I don't think it can be solved by periodic re-connecting because we can't make that progress eager enough to avoid annoying delays when restarting a node. I like the suggestion to use the first K nodes (for large enough K, probably the replication factor is a good heuristic?) because that's straightforward to explain and configure. Yes, it's a bit more to type/copy-paste for completely manual deployments. But in practice, anyone in their right mind will have an ansible playbook to do this kind of thing (one that we provide?), and then it's really straightforward.

Perhaps we can make gossip persistence opt-in? Dynamically provisioned environments don't need it (since they point at some load balancer), and for statically provisioned ones, well, if there remains a good reason to want it, knock yourself out. Or we fold persistence into the join flag, like so: --join=127.0.0.1:55812?persist=7` (persist this address, remove it after 7 successive failed connection attempts; can set to -1) though that makes it unclear what happens to previously persisted ones when the join flags change. (But then again, this is also a relevant question - if the join flags change, doesn't that somehow signal that old persisted addresses should be expired?)

bdarnell commented 7 years ago

if the join flags change, doesn't that somehow signal that old persisted addresses should be expired

That sounds like a good heuristic to me, if we don't get rid of gossip persistence completely.

cockroachdb / cockroach

gossip: revisit bootstrap address persistence #18056

nodes-1-connected).