Backup and Restore for etcd

technosophos commented 8 years ago

In the event of a catastrophic etcd cluster failure, etcd should be able to restart itself and initialize into a previous known-good state.

Cluster failure happens when all of the nodes on an etcd cluster are terminated.

Currently, when a cluster failures, the first node to recover will re-initialize the discovery process with the etcd-dicsover service. But it will not recover the data.

What we want is for one or more nodes in a cluster to ship WAL logs to a known location (and maybe full backups as well) at periodic intervals. Then, when a cluster fails, it should grab the last successful backup and import the data from that file.

I believe that the best way to accomplish this will be to use etcd's snapshot backup/restore system. https://github.com/coreos/etcd/blob/master/Documentation/04_to_2_snapshot_migration.md

technosophos commented 8 years ago

Fulfills the following requirements on https://github.com/deis/deis/issues/4809

Recoverability: documented backup + restore
Availability: HA configuration deployed by default

And also it should cover the recovery tests in

Testing: single, multi-node failure, recovery

rimusz commented 8 years ago

this one should be closed, as we have no etcd anymore

deis / etcd

Backup and Restore for etcd #36