Open prisamuel opened 6 years ago
With further thought, I think the local storage should just be a ZFS filesystem in the pool and not a dot. There's no point in replicating it with dotmesh when all nodes see the same etcd and snapshot it, we don't need an infinite history of commits, and the concurrency limit of writing to a dot might be a problem.
We should store a "backup snapshot counter" in etcd. Every node will periodically (whenever a "dirty" flag has been set, which is set after operations that update state we really care about; or N minutes after last time) dump all of etcd into a local ZFS filesystem, and atomically get+increment the snapshot counter in etcd, and write the snapshot counter and a wall-clock time into ZFS. The writes into ZFS should be done atomically, by writing into a new file called snapshot.tmp
and then renaming it to snapshot
when it's finished. The file should have a simple header with the snapshot counter and wall time in, a delimiter, than the raw etcd dump in JSON format.
On startup, the node should consider several cases.
Inputs:
Cases:
"RECOVERY MODE" is:
/dotmesh-io/recovery/offers/SERVER ID
/dotmesh-io/recovery/proceed
to be created./dotmesh-io/recovery/proceed
is set to our server ID, start a goroutine that:
/dotmesh-io/recovery
/dotmesh-io/recovery/complete
to something./dotmesh-io/recovery/complete
to be set to something.The special recovery RPCs are just:
DotmeshRPC.GetRecoveryStatus
- returns the recovery flag, and everything under /dotmesh-io/recovery
from etcd (but in a nice format).DotmeshRPC.InitiateRecovery
- sets /dotmesh-io/recovery/proceed
to a specified server ID.The dm
command line tool needs to have an interface that calls DotmeshRPC.GetRecoveryStatus
and dumps the results. If we're waiting for proceed
then it suggests the most recent (highest snapshot counter) node, and lets the user choose to set proceed
to it with DotmeshRPC.InitiateRecovery
, or to pick another node (for some reason) for DotmeshRPC.InitiateRecovery
, or to do nothing and wait.
Have I missed anything?
LGTM, make it so! :)
The snapshot counter in etcd should be kept somewhere OTHER than /dotmesh-io/recovery
, so that it gets restored from the snapshot. /dotmesh-io/recovery
is purely for the running state of a recovery operation. However, we must ensure that the snapshot counter is the last thing to be recovered from the snapshot, just before /dotmesh-io/recovery/complete
is set, as the snapshot counter's presence is the marker used by newly-starting nodes to tell if they need to go into recovery or not.
Let's consider edge cases of nodes starting up while other nodes are at various points in the process...
proceed
then complete
. That's OK.proceed
with recovery just shrugs and runs with the older state. This might be a mistake, so here's a modification to the snapshot algorithm: When writing a snapshot to local ZFS storage, if the previous snapshot there is NEWER (higher snapshot counter) than the snapshot we're about to run, move it to a name that won't get overwritten (orphan-snapshot-at-COUNTER-TIMESTAMP
) and log the fact so somebody knows it's there.Does case 3 in https://github.com/dotmesh-io/dotmesh/issues/359#issuecomment-376594581 interact badly with case 5 in your "Cases" list in https://github.com/dotmesh-io/dotmesh/issues/359#issuecomment-376583768?
How do we coordinate unsetting the dirty flag if it's in etcd?
One of the benefits of storing the backup in a dot is that it can be pushed to a backup cluster as part of a backup that just involves pushing all the dots in a cluster to another cluster.
Maybe we should back up etcd to a dot and to a per-node etcd-backup
filesystem for recovery mode?
This covers: (a) loss of etcd but retaining (some of) the zfs filesystems (b) complete cluster loss and recovery from off-site backup
Recovery mode and its UX should ideally be able to recover from a dot with some special marking if no etcd-backup
exists?
The dirty flag is per-node in RAM. That node sets it when it knows it's written something fun to etcd; it has to be per-node as it's unset when that node has done a snapshot. The goroutine that does the snapshots should clear the flag as soon as it's noticed it's set and started the snapshot, so if another event puts something cool in etcd while the snapshot is happening, a new snapshot will happen right after to make sure it's saved.
Yeah, I like @lukemarsden's idea of putting the snapshots into a dot as a "second generation" while still having the simplicity and immediacy of a local dump! Let's say that the snapshot goroutine, if the current node is master for the snapshot dot, snapshots into it (and just leaves a marker in its snapshot ZFS filesystem pointing to the snapshot dot filesystem, which the recovery process follows - perhaps even a literal symlink...)
Alternative: https://github.com/kopeio/etcd-manager
Update: doesn't run on Kube, so not suitable for our use-case.
Alternative: see if upgrading etcd operator solves our woes.
Update: it doesn't.
We've done a lot of the recovery work in rpc.go.
Recovery options: