etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.83k stars 9.77k forks source link

Corrupted WAL and snapshot restoring process #10219

Closed brk0v closed 4 years ago

brk0v commented 6 years ago

Issue

Node that was offline more than max(SnapshotCount, DefaultSnapshotCatchUpEntries) corrupts its WAL log with bad HardState.Commitnumber if it's killed right after HardState was saved to non-volatile storage (failpoint: raftBeforeSaveSnap).

Specific

Version: master Environment: any (tested on Linux, MacOS X)

Steps to reproduce

  1. Patch snapshots defaults to trigger the issues easier:
+++ b/etcdserver/server.go
@@ -65,14 +65,14 @@ import (
 )

 const (
-       DefaultSnapshotCount = 100000
+       DefaultSnapshotCount = 10

        // DefaultSnapshotCatchUpEntries is the number of entries for a slow follower
        // to catch-up after compacting the raft storage entries.
        // We expect the follower has a millisecond level latency with the leader.
        // The max throughput is around 10K. Keep a 5K entries is enough for helping
        // follower to catch up.
-       DefaultSnapshotCatchUpEntries uint64 = 5000
+       DefaultSnapshotCatchUpEntries uint64 = 10
  1. Compile with failpoints to emulate power failure during the process of snapshot restoration:
FAILPOINTS="true" make build
  1. Create a Procfile with a failpoint raftBeforeSaveSnap for etcd2 node :
etcd1: bin/etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:12380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug 
etcd2: GOFAIL_HTTP="127.0.0.1:1111" GOFAIL_FAILPOINTS='go.etcd.io/etcd/etcdserver/raftBeforeSaveSnap=panic("raftBeforeSaveSnap")' bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug 
etcd3: bin/etcd --name infra3 --listen-client-urls http://127.0.0.1:32379 --advertise-client-urls http://127.0.0.1:32379 --listen-peer-urls http://127.0.0.1:32380 --initial-advertise-peer-urls http://127.0.0.1:32380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug
  1. Start cluster

    goreman start
  2. Start write loop:

    for i in {0..100000}; do ./bin/etcdctl put key$i value$i; echo $i; done
  3. Stop etcd2 node for a "maintenance":

    goreman run stop etcd2
  4. Start etcd2 node after 10 entries have been written to the master to trigger snapshot restore:

    sleep 10 && goreman run start etcd2
  5. Should get a failoint panic, that emulates power issue during restoring from a snapshot.

From now WAL on the etcd2 node is corrupted. It was saved with a HardState entry that contains Commit number from the snapshot, but snapshot was never saved to WAL and disk.

WAL is corrupted.

  1. Run etcd2 node manually without fail point and got an error:
    bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug

Error:

panic: 91bc3c398fb3c146 state.commit 961 is out of range [572, 579]

goroutine 1 [running]:
go.etcd.io/etcd/vendor/go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001cc160, 0x0, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/vendor/go.uber.org/zap/zapcore/entry.go:229 +0x515
go.etcd.io/etcd/vendor/go.uber.org/zap.(*SugaredLogger).log(0xc0001ce118, 0x4, 0x1d063df, 0x2b, 0xc000643f40, 0x4, 0x4, 0x0, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/vendor/go.uber.org/zap/sugar.go:234 +0xf6
go.etcd.io/etcd/vendor/go.uber.org/zap.(*SugaredLogger).Panicf(0xc0001ce118, 0x1d063df, 0x2b, 0xc000643f40, 0x4, 0x4)
        /Users/brk0v/go/src/go.etcd.io/etcd/vendor/go.uber.org/zap/sugar.go:159 +0x79
go.etcd.io/etcd/pkg/logutil.(*zapRaftLogger).Panicf(0xc0000601f0, 0x1d063df, 0x2b, 0xc000643f40, 0x4, 0x4)
        /Users/brk0v/go/src/go.etcd.io/etcd/pkg/logutil/zap_raft.go:96 +0x61
go.etcd.io/etcd/raft.(*raft).loadState(0xc0000f8000, 0x2, 0x8211f1d0f64f3269, 0x3c1, 0x0, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/raft/raft.go:1459 +0x1b1
go.etcd.io/etcd/raft.newRaft(0xc000450360, 0x4)
        /Users/brk0v/go/src/go.etcd.io/etcd/raft/raft.go:368 +0xd52
go.etcd.io/etcd/raft.RestartNode(0xc000450360, 0x1e98e20, 0xc0000601f0)
        /Users/brk0v/go/src/go.etcd.io/etcd/raft/node.go:232 +0x43
go.etcd.io/etcd/etcdserver.restartNode(0x7fff5fbff840, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc0001aae80, 0x1, 0x1, 0xc0001ab180, ...)
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdserver/raft.go:548 +0x7ad
go.etcd.io/etcd/etcdserver.NewServer(0x7fff5fbff840, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc0001aae80, 0x1, 0x1, 0xc0001ab180, ...)
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdserver/server.go:464 +0x3038
go.etcd.io/etcd/embed.StartEtcd(0xc0002b4000, 0xc0002b4500, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/embed/etcd.go:203 +0x8ed
go.etcd.io/etcd/etcdmain.startEtcd(0xc0002b4000, 0x1cde6b6, 0x6, 0xc0001ab401, 0x2)
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdmain/etcd.go:304 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdmain/etcd.go:146 +0x2f37
go.etcd.io/etcd/etcdmain.Main()
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdmain/main.go:47 +0x37
main.main()
        /Users/brk0v/go/src/go.etcd.io/etcd/main.go:28 +0x20
gyuho commented 6 years ago

Is this reproducible with current master?

brk0v commented 6 years ago

Yes, you could reproduce this with the latest commit 583763261f1c843e07c1bf7fea5fb4cfb684fe87

brk0v commented 5 years ago

Appreciate if someone could confirm this and share ideas how it could be fixed. I tried rearrange data saving steps to the WAL with some code changes, but it seems like we need to do all saves in one transaction like dgraph does https://github.com/dgraph-io/dgraph/blob/master/raftwal/storage.go#L574

xiang90 commented 5 years ago

@brk0v

I am aware of this issue. Would you like to spend some time to get it fixed?

brk0v commented 5 years ago

Sure thing, but if you already have an idea to check or a direction to dig, I could try to do that.

brk0v commented 5 years ago

Added code that passed my local failpoints test. Could you please check if this approach makes sense. Thank you!

P.S. Unit tests are broken because of interface changes in etcdserver.Storage code.

brk0v commented 5 years ago

Tests now pass.

To summarise what were done:

xiang90 commented 5 years ago

@brk0v Thanks. I will give this a careful look over the next couple of weeks.

jingyih commented 5 years ago

cc @jpbetz

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

jpbetz commented 4 years ago

I'm working on rebasing #10356 on master and making a couple adjustments to it (retaining the commits from @brk0v). I'll send out a PR shortly.