In the event of a catastrophic etcd cluster failure, etcd should be able to restart itself and initialize into a previous known-good state.
Cluster failure happens when all of the nodes on an etcd cluster are terminated.
Currently, when a cluster failures, the first node to recover will re-initialize the discovery process with the etcd-dicsover service. But it will not recover the data.
What we want is for one or more nodes in a cluster to ship WAL logs to a known location (and maybe full backups as well) at periodic intervals. Then, when a cluster fails, it should grab the last successful backup and import the data from that file.
In the event of a catastrophic etcd cluster failure, etcd should be able to restart itself and initialize into a previous known-good state.
Cluster failure happens when all of the nodes on an etcd cluster are terminated.
Currently, when a cluster failures, the first node to recover will re-initialize the discovery process with the etcd-dicsover service. But it will not recover the data.
What we want is for one or more nodes in a cluster to ship WAL logs to a known location (and maybe full backups as well) at periodic intervals. Then, when a cluster fails, it should grab the last successful backup and import the data from that file.
I believe that the best way to accomplish this will be to use etcd's snapshot backup/restore system. https://github.com/coreos/etcd/blob/master/Documentation/04_to_2_snapshot_migration.md