Bootstrap etcd data from a special dot

prisamuel commented 6 years ago

[ ] Backup etcd data to a "special" dot in dotmesh cluster. Bootstrapping system can notice the etc dot and an empty etcd, and restore data into etcd from the dot.

Recovery options:

automatic master election based on a timestamp where the etcd data is written to the nodes.
manual 'dm recover' where the user chooses which node to recover etcd data from. Preferred option, and avoids implementing raft ourselves.

alaric-dotmesh commented 6 years ago

With further thought, I think the local storage should just be a ZFS filesystem in the pool and not a dot. There's no point in replicating it with dotmesh when all nodes see the same etcd and snapshot it, we don't need an infinite history of commits, and the concurrency limit of writing to a dot might be a problem.

We should store a "backup snapshot counter" in etcd. Every node will periodically (whenever a "dirty" flag has been set, which is set after operations that update state we really care about; or N minutes after last time) dump all of etcd into a local ZFS filesystem, and atomically get+increment the snapshot counter in etcd, and write the snapshot counter and a wall-clock time into ZFS. The writes into ZFS should be done atomically, by writing into a new file called snapshot.tmp and then renaming it to snapshot when it's finished. The file should have a simple header with the snapshot counter and wall time in, a delimiter, than the raw etcd dump in JSON format.

On startup, the node should consider several cases.

Inputs:

There is a snapshot file in local ZFS (because this zpool isn't new).
There is a snapshot counter in etcd (because the etcd state already exists).

Cases:

No snapshot counter in ZFS, nor etcd - new node on a new etcd - proceed as usual
No snapshot counter in ZFS, one in etcd - new node added to existing cluster, proceed as usual
Snapshot counter in ZFS, none in etcd - go into RECOVERY MODE
Snapshot counter in ZFS and it's lower than the one in etcd - existing node returning to an existing cluster, proceed as usual
Snapshot counter in ZFS and it's higher than the one in etcd - REFUSE TO START as etcd has been rolled back; require manual confirmation to start (which will cause the loss of our snapshot state in ZFS as the backup system later replaces it with an older one), or let the user nuke etcd and re-start recovery mode with this node present.

"RECOVERY MODE" is:

Set a flag so that all RPCs (except special recovery ones explained below), all docker plugin actions, etc return a useful error message saying something like "This node is in recovery mode (link to docs about what to do)"
Announce in etcd our snapshot counter and wall-clock time of last snapshot, filed under our server ID: /dotmesh-io/recovery/offers/SERVER ID
Watch for /dotmesh-io/recovery/proceed to be created.
If /dotmesh-io/recovery/proceed is set to our server ID, start a goroutine that:
1. Reads our ZFS snapshot into etcd, except for anything under /dotmesh-io/recovery
2. Sets /dotmesh-io/recovery/complete to something.
Wait for /dotmesh-io/recovery/complete to be set to something.
Unset the recovery flag, and proceed with normal operation.

The special recovery RPCs are just:

DotmeshRPC.GetRecoveryStatus - returns the recovery flag, and everything under /dotmesh-io/recovery from etcd (but in a nice format).
DotmeshRPC.InitiateRecovery - sets /dotmesh-io/recovery/proceed to a specified server ID.

The dm command line tool needs to have an interface that calls DotmeshRPC.GetRecoveryStatus and dumps the results. If we're waiting for proceed then it suggests the most recent (highest snapshot counter) node, and lets the user choose to set proceed to it with DotmeshRPC.InitiateRecovery, or to pick another node (for some reason) for DotmeshRPC.InitiateRecovery, or to do nothing and wait.

Have I missed anything?

lukemarsden commented 6 years ago

LGTM, make it so! :)

alaric-dotmesh commented 6 years ago

The snapshot counter in etcd should be kept somewhere OTHER than /dotmesh-io/recovery, so that it gets restored from the snapshot. /dotmesh-io/recovery is purely for the running state of a recovery operation. However, we must ensure that the snapshot counter is the last thing to be recovered from the snapshot, just before /dotmesh-io/recovery/complete is set, as the snapshot counter's presence is the marker used by newly-starting nodes to tell if they need to go into recovery or not.

Let's consider edge cases of nodes starting up while other nodes are at various points in the process...

Any node that starts before recovery is completed won't find a snapshot counter in etcd, so will go into recovery. If recovery is already proceeding they'll write their details into the registry of available snapshots and carry on to block for proceed then complete. That's OK.
A node that starts just as recovery has ended, and sees the snapshot already there, starts up as usual.
A node with a more recent snapshot than that picked to proceed with recovery just shrugs and runs with the older state. This might be a mistake, so here's a modification to the snapshot algorithm: When writing a snapshot to local ZFS storage, if the previous snapshot there is NEWER (higher snapshot counter) than the snapshot we're about to run, move it to a name that won't get overwritten (orphan-snapshot-at-COUNTER-TIMESTAMP) and log the fact so somebody knows it's there.

lukemarsden commented 6 years ago

Does case 3 in https://github.com/dotmesh-io/dotmesh/issues/359#issuecomment-376594581 interact badly with case 5 in your "Cases" list in https://github.com/dotmesh-io/dotmesh/issues/359#issuecomment-376583768?

lukemarsden commented 6 years ago

How do we coordinate unsetting the dirty flag if it's in etcd?

lukemarsden commented 6 years ago

One of the benefits of storing the backup in a dot is that it can be pushed to a backup cluster as part of a backup that just involves pushing all the dots in a cluster to another cluster.

Maybe we should back up etcd to a dot and to a per-node etcd-backup filesystem for recovery mode?

This covers: (a) loss of etcd but retaining (some of) the zfs filesystems (b) complete cluster loss and recovery from off-site backup

Recovery mode and its UX should ideally be able to recover from a dot with some special marking if no etcd-backup exists?

alaric-dotmesh commented 6 years ago

The dirty flag is per-node in RAM. That node sets it when it knows it's written something fun to etcd; it has to be per-node as it's unset when that node has done a snapshot. The goroutine that does the snapshots should clear the flag as soon as it's noticed it's set and started the snapshot, so if another event puts something cool in etcd while the snapshot is happening, a new snapshot will happen right after to make sure it's saved.

alaric-dotmesh commented 6 years ago

Yeah, I like @lukemarsden's idea of putting the snapshots into a dot as a "second generation" while still having the simplicity and immediacy of a local dump! Let's say that the snapshot goroutine, if the current node is master for the snapshot dot, snapshots into it (and just leaves a marker in its snapshot ZFS filesystem pointing to the snapshot dot filesystem, which the recovery process follows - perhaps even a literal symlink...)

lukemarsden commented 6 years ago

Alternative: https://github.com/kopeio/etcd-manager

Update: doesn't run on Kube, so not suitable for our use-case.

lukemarsden commented 6 years ago

Alternative: see if upgrading etcd operator solves our woes.

Update: it doesn't.

lukemarsden commented 6 years ago

We've done a lot of the recovery work in rpc.go.

dotmesh-io / dotmesh

Bootstrap etcd data from a special dot #359