etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.84k stars 9.77k forks source link

ETCD fails to start after performing alarm list operation and then power off/on #14382

Closed tobgu closed 2 years ago

tobgu commented 2 years ago

What happened?

With a cluster of three nodes setup using https://github.com/bitnami/charts/blob/master/bitnami/etcd/README.md it was noticed that the ETCD nodes failed to start after a forced reboot of the underlying worker nodes. A graceful shutdown will not result in this issue.

The logs indicated a mismatch in the raft log index between the v2 *.snap files and the v3 db file where the index of the snap files was higher than that of the v3 db file causing ETCD to look for a snap.db file that did not exist (see logs).

The index of the snap file was derived from the file name (eg. 0000000000000017-0000000000124f8c.snap) while the consistent_index of the v3 db was extracted using bbolt, bbolt get db meta consistent_index | hexdump => 0xb4903.

So far the issue looked very much like what is described in #11949. The "fix" described in that issue to get the cluster up and running again also worked, to remove/move the *.snap files.

Worth mentioning: This cluster had not had any writes to it for serveral weeks ahead of the reboot. The data in it is mostly read. Doing a proper write to the cluster will set the consistent_index of the v3 DB to an up-to-date value of the raft index.

After some investigation into why this index difference the between the snapshots and the v3 store occurred it was found that the health check executed regularly by Kubernetes was the reason for the version drift.

The health and readiness check regularly executes etcdctl endpoint health to determine if the cluster is healthy or not. In ETCD 3.4 this was a simple GET on the health key but since https://github.com/etcd-io/etcd/pull/12150 it also includes checking the alarm list to verify that it is empty. For some reason listing the alarms also triggers a write/apply (see attached logs). And for some reason this apply is only applied to the V2 store, not the V3 store. This cause the index in the V2 store to drift from the V3 store until a proper write is performed. I have not dug into the reason for why the write is performed and why it is missing from the V3 store.

The behaviour is only present in this form in 3.5 since the health check in 3.4 does not include listing the alarms.

The problem is easy to reproduce locally. See description.

What did you expect to happen?

I would always expect ETCD to be able to start properly regardless of how the shutdown was done.

How can we reproduce it (as minimally and precisely as possible)?

Locally:

# Use very low snapshot count to more quickly trigger the issue
$  ./etcd --snapshot-count=5 --log-level=debug

# In a different terminal, repeat six times to trigger creation of a snapshot file (you'll see a `triggering snapshot` log message in the server log when this happens)
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health

# Kill etcd server process
$ kill -9 <etcd PID>

# Try to start ETCD again, this will fail with the attached log message
$ ./etcd --snapshot-count=5 --log-level=debug

Anything else we need to know?

No response

Etcd version (please run commands below)

```console $ etcd --version etcd Version: 3.5.4 Git SHA: 08407ff76 Go Version: go1.16.15 Go OS/Arch: darwin/amd64 $ etcdctl version etcdctl version: 3.5.4 API version: 3.5 ```

Etcd configuration (command line flags or environment variables)

See steps to reproduce

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

```console See steps to reproduce ```

Relevant log output

# Log printed upon startup failure
{"level":"info","ts":"2022-07-23T18:21:35.360Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/bitnami/etcd/data/member/snap/db","took":"174.548µs"}
{"level":"info","ts":"2022-07-23T18:21:36.259Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":1200012,"snapshot-size":"25 kB"}
{"level":"warn","ts":"2022-07-23T18:21:36.260Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":1200012,"snapshot-file-path":"/bitnami/etcd/data/member/snap/0000000000124f8c.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-07-23T18:21:36.261Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000506300, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000010fa0, 0x1234726, 0x2a, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0xc0000460ca, 0xd, 0x0, 0x0, 0x0, 0x0, 0xc0005150e0, 0x2, 0x2, 0xc00053c480, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000104000, 0xc000104600, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000104000, 0x12089be, 0x6, 0xc0004b7201, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f

# Server debug logs when running `etcdctl endpoint health`
{"level":"debug","ts":"2022-08-25T15:21:32.425+0200","caller":"v3rpc/interceptor.go:182","msg":"request stats","start time":"2022-08-25T15:21:32.425+0200","time spent":"114.504µs","remote":"127.0.0.1:51728","response type":"/etcdserverpb.KV/Range","request count":0,"request size":8,"response count":0,"response size":28,"request content":"key:\"health\" "}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2122","msg":"Applying entries","num-entries":1}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2125","msg":"Applying entry","index":70,"term":5,"type":"EntryNormal"}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2184","msg":"apply entry normal","consistent-index":60,"entry-index":70,"should-applyV3":true}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2211","msg":"applyEntryNormal","raftReq":"header:<ID:7587864799024601626 > alarm:<> "}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"v3rpc/interceptor.go:182","msg":"request stats","start time":"2022-08-25T15:21:32.426+0200","time spent":"241.967µs","remote":"127.0.0.1:51728","response type":"/etcdserverpb.Maintenance/Alarm","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"info","ts":"2022-08-25T15:21:32.428+0200","caller":"zapgrpc/zapgrpc.go:174","msg":"[transport] transport: loopyWriter.run returning. connection error: desc = \"transport is closing\""}
serathius commented 2 years ago

Easily reproducible with script:

#!/bin/bash
./bin/etcd --snapshot-count=5 &
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
kill -9 %1
./bin/etcd --snapshot-count=5
serathius commented 2 years ago

Really great analysis and repro! @tobgu would you be willing to send PR with a fix?

tobgu commented 2 years ago

I would have to spend time trying to understand the V2/V3 sync logic for that and why it is failing in this case. If nobody else knows of a quick solution/fix to this I can do that but I was hoping that someone well versed in this area would pick it up. ;-)

serathius commented 2 years ago

Heh, only if I could say that I understand it better. From what I investigated problem is due to consistent index (CI) not being saved when applying Alarms entry. I managed to confirm that this behavior was introduced between v3.5.2 and v3.5.3. Meaning that this is a bug for in fix for data inconsistency issue :(.

As the result snapshot index is greater than CI, which during etcd bootstrap is interpreted that etcd crushed during process of receiving snapshot from leader. Etcd panics as it assumes that there should be a v3 snapshot received from leader.

serathius commented 2 years ago

Ok, looks like v3.5.3 assumes that performing applyV3 without an error means that consistency index was commited. However that's not true in case of Alarms, as it doesn't touch db, meaning it doesn't open any transactions, which would lead to executing backend hooks.

Backend hooks bad, again.

serathius commented 2 years ago

Some more details, issue is triggered when etcd crashes after snapshot that was followed by only Alarms entries. It causes a transitory bad state of db, would prevent etcd to recover after crash. It's transitory as db will be fixed after any WAL entry that is not Alarms.

serathius commented 2 years ago

Impact: Low as the issue should be extremely rare.

It requires:

Still it brings up the issue of how incomprehensible etcd apply code is. I think this will be third times we try to fix just this part of the code in v3.5.X

tobgu commented 2 years ago

I'm not sure I agree with the conclusion that this is an extremely rare case.

Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...

My current workaround is to not use the default probes in the chart but rather hit the /health HTTP endpoint instead (which doesn't seem to suffer from the same problem).

serathius commented 2 years ago

Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...

Can you link to how Bitnami does healthcheck?

serathius commented 2 years ago

cc @ahrtr @ptabor for their opinion

tobgu commented 2 years ago

Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...

Can you link to how Bitnami does healthcheck?

Sure!

This is how the default liveness- and readiness probes and are setup (the ones I've now replaced with http probes against /health): https://github.com/bitnami/charts/blob/master/bitnami/etcd/templates/statefulset.yaml#L265-L289

And this is the shell script that is called by the above probes: https://github.com/bitnami/bitnami-docker-etcd/blob/master/3.5/debian-11/rootfs/opt/bitnami/scripts/etcd/healthcheck.sh

ahrtr commented 2 years ago

Thanks @tobgu for raising this issue ( a real issue)!

It turns out to be a regression introduced in 3.5.4 in https://github.com/etcd-io/etcd/pull/13854 (https://github.com/etcd-io/etcd/pull/13908). The alarm list is the only exception that doesn't move consistent_index forward. The reproduction steps are as simple as,

etcd --snapshot-count=5 &
for i in {1..6}; do etcdctl  alarm list; done 
kill -9 <etcd_pid>
etcd

Short term solution

Solution 1

Lock batch_tx in (*AlarmStore) Get, so that it calls the txPostLockInsideApplyHook, and accordingly move consistent_index forward.

Just need to add code something like below into (*AlarmStore) Get. It's the simplest change, but it looks ugly, because it doesn't make sense for alarmList to acquire the batchTx lock at all.

    tx := s.be.BatchTx()
    tx.LockInsideApply()
    defer tx.Unlock()

Solution 2

Change server.go#L1853-L1855 to something like below. I don't know why I did not do this previously. Will think about this more and get back if I recall something new.

            newIndex := s.consistIndex.ConsistentIndex()
            if newIndex < e.Index {
                s.consistIndex.SetConsistentIndex(e.Index, e.Term)
            }

Solution 3

Change the alarmList to use linearizableReadLoop, so that it doesn't go through the raft & applying workflow at all. Accordingly it will not advance the applyIndex at all, and the snapshot Index will not be advanced.

Long term solution

Get rid of the OnPreCommitUnsafe added in 3.5.0 and the txPostLockInsideApplyHook/LockInsideApply/LockOutsideApply added in main & 3.5.3 & 3.5.4.

I will take care of the long-term solution.

For short-term, solution 2 above looks the best for now. Anyone feel free to deliver a PR.

ahrtr commented 2 years ago

Is anyone working on this issue?

tobgu commented 2 years ago

I'm not working on this. If no one else picks it up I might be able to find some time for it in a couple of weeks. Right now my priorities do not allow it. I'm happy with the current workaround (a change in the health check used) as a short term solution in terms of stability but would of course want this to be fixed as a proper long term solution in order to not have to worry about triggering the behaviour experienced in some other way.

ahrtr commented 2 years ago

In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

wcmjunior commented 2 years ago

In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

This solution fixed our problem in a production etcd configuration. Just commenting here so more people can follow. Thanks!

kycfeel commented 1 year ago

In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

@ahrtr You are a life saver!!!!! This solved my issue like magic. Thank you so much.