etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.79k stars 9.77k forks source link

Offline member addition/removal for restoration after quorum loss #17638

Open tjungblu opened 7 months ago

tjungblu commented 7 months ago

What would you like to be added?

We would like to reduce the MTTR on clusters that suffer from irreparable quorum loss. To examplify, two out of three members are gone for good, as in the classic story where somebody came with an axe to your datacenter and smashed the actual servers and they are not recoverable by any means. Important here is also that they are not intended to come back anytime soon, i.e. through temporary network partition scenarios.

Mind you, lost quorum also means that etcdctl will not work anymore, so the restoration procedure in our docs is mostly moot. It's also unlikely that there is a backup snapshot taken, which is more recent than the current state in the dataDir.

While --force-new-cluster is an option in a three member cluster with one member left, it's a bit cumbersome to reconfigure when you run with static pods. This gets even more annoying with a five member setup, where three are gone: which of the remaining two do you choose to continue? It's not easy to figure out which members are on the latest revision with the existing tooling.

To aid recovery, I would like to propose three new commands to etcdutl querying and manipulating the existing dataDir, in correspondence to what we currently have in etcdctl:

  1. member list - dumps the current membership store according to the supplied format (simple, table, json, yaml)
  2. member remove <member id(s)> - similar to force-new-cluster [1] this is rewriting the current membership store by filtering out the supplied member id(s)
  3. member add <member name> --peer-urls <peer urls> - adds a given member into the current membership store
  4. member promote <member name> - promotes a learner (from https://github.com/etcd-io/etcd/discussions/17794)

Reading revisions is implemented in etcdutl snapshot status already.

As with etcdutl defrag, those commands are only ever intended to be run when etcd is not running, remove and add should be considered unsafe. We need to consider the impact on the current membership storage migration, but the cmdline needs to be backward compatible to both stores anyway.

[1] https://github.com/etcd-io/etcd/blob/9359aef3e3dd39b7bbf57cab4b6899a238af3144/server/etcdserver/bootstrap.go#L568-L571

Why is this needed?

Currently there is no way to manipulate the cluster membership without a live cluster and quorum.

ahrtr commented 7 months ago

It's not easy to figure out which members are on the latest revision with the existing tooling.

Does this meet your requirement?

$ ./bin/etcdutl snapshot status ~/box/open_source/etcd/data/k8s_1.21.5/db  -w table
+----------+----------+------------+------------+---------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE | VERSION |
+----------+----------+------------+------------+---------+
| 30089a77 |  2805430 |       1296 |     7.5 MB |         |
+----------+----------+------------+------------+---------+

To aid recovery, I would like to propose three new commands to etcdutl querying and manipulating the existing dataDir, in correspondence to what we currently have in etcdctl:

You can specify all the new members when executing etcdutl restore command. So you don't have to necessarily add/remove members on an offline db file.

tjungblu commented 7 months ago

good catch, let me try to restore my cluster with those commands. I'll update the docs then :)

tjungblu commented 7 months ago

@ahrtr do I understand the code in the etcdutl snapshot restore correctly that it will not attempt to replay the WAL?

Just a more concrete example:

[core@master-3 ~]$ sudo ls -l /var/lib/etcd/member/snap
total 92248
-rw-r--r--. 1 root root     35459 Mar 21 16:40 0000000000000012-00000000000629bf.snap
-rw-r--r--. 1 root root     37687 Mar 22 09:13 0000000000000014-00000000000650d0.snap
-rw-r--r--. 1 root root     37687 Mar 22 09:24 0000000000000014-00000000000677e1.snap
-rw-r--r--. 1 root root     37687 Mar 25 10:03 0000000000000015-0000000000083d6c.snap
-rw-r--r--. 1 root root     39918 Mar 25 10:09 0000000000000016-000000000008647d.snap
-rw-------. 1 root root 108752896 Mar 25 10:46 db

[core@master-3 ~]$ sudo ls -l /var/lib/etcd/member/wal
total 375044
-rw-------. 1 root root 64000000 Mar 25 10:46 0.tmp
-rw-------. 1 root root 64010344 Mar 22 09:13 000000000000001f-0000000000063002.wal
-rw-------. 1 root root 64002200 Mar 22 09:23 0000000000000020-0000000000064f0e.wal
-rw-------. 1 root root 64010672 Mar 22 09:35 0000000000000021-00000000000674b2.wal
-rw-------. 1 root root 64015728 Mar 25 10:11 0000000000000022-0000000000068f8b.wal
-rw-------. 1 root root 64000000 Mar 25 10:46 0000000000000023-0000000000086917.wal

If I attempt to restore using ./etcdutl snapshot restore /var/lib/etcd/member/snap/db --data-dir /tmp/restored_datadir --skip-hash-check , it will ignore all the data in the WAL directory?

ahrtr commented 7 months ago

that it will not attempt to replay the WAL?

Correct, I believe so. The etcdutl snapshot command only read the db or v3_snapshot file.

tjungblu commented 7 months ago

That's a bummer, because that's exactly what you would need when you just have one last member running.

I've updated the docs to give them a bit more structure: https://github.com/etcd-io/website/pull/818 - also explaining https://github.com/kubernetes/kubernetes/issues/118501 along the way.