Closed gyuho closed 6 years ago
@lyddragon Did you upgrade from previous version? Or just fresh cluster from v3.3.0-rc.0 and this happened? Haven't been able to reproduce yet.
no update,but use old snapshot to restore.
old snapshot to restore.
Snapshot from v3.3.0-rc.0 server, and you downloaded snapshot using v3.3.0-rc.0 etcdctl, right?
Finally reproduced. I will try to fix shortly. Auth was not related. Seems like snapshot restore has some bugs.
@lyddragon basically, we are doing
Just curious, what is your use case on this?
If you want to destroy member A completely, I would remove/add back to the cluster.
add failed, look my other issue
"v3.3.0-rc.0 endpoint health --cluster with auth requires password input twice #9094"
@xiang90 I double checked our code and don't think it's a bug.
Reproducible e2e test case:
// TestCtlV3SnapshotRestoreMultiCluster ensures that restoring one member from snapshot
// does not panic when rejoining the cluster (fix https://github.com/coreos/etcd/issues/9096).
func TestCtlV3SnapshotRestoreMultiCluster(t *testing.T) {
testCtl(t, snapshotRestoreMultiClusterTest, withCfg(configNoTLS), withQuorum())
}
func snapshotRestoreMultiClusterTest(cx ctlCtx) {
if err := ctlV3Put(cx, "foo", "bar", ""); err != nil {
cx.t.Fatalf("ctlV3Put error (%v)", err)
}
fpath := filepath.Join(os.TempDir(), "test.snapshot")
defer os.RemoveAll(fpath)
if err := ctlV3SnapshotSave(cx, fpath); err != nil {
cx.t.Fatalf("snapshotTest ctlV3SnapshotSave error (%v)", err)
}
// shut down first member, restore, restart from snapshot
if err := cx.epc.procs[0].Close(); err != nil {
cx.t.Fatalf("failed to close (%v)", err)
}
ep, ok := cx.epc.procs[0].(*etcdServerProcess)
if !ok {
cx.t.Fatalf("expected *etcdServerProcess")
}
newDataDir := filepath.Join(os.TempDir(), "snap.etcd")
os.RemoveAll(newDataDir)
defer os.RemoveAll(newDataDir)
ep.cfg.dataDirPath = newDataDir
if err := spawnWithExpect(append(
cx.PrefixArgs(),
"snapshot", "restore", fpath,
"--name", ep.cfg.name,
"--initial-cluster", ep.cfg.initialCluster,
"--initial-cluster-token", ep.cfg.initialToken,
"--initial-advertise-peer-urls", ep.cfg.purl.String(),
"--data-dir", newDataDir),
"membership: added member"); err != nil {
cx.t.Fatalf("failed to restore (%v)", err)
}
for i := range ep.cfg.args {
if ep.cfg.args[i] == "--data-dir" {
ep.cfg.args[i+1] = newDataDir
break
}
}
ep.cfg.args = append(ep.cfg.args, "--initial-cluster-state", "existing")
var err error
ep.proc, err = spawnCmd(append([]string{ep.cfg.execPath}, ep.cfg.args...))
if err != nil {
cx.t.Fatalf("failed to spawn etcd (%v)", err)
}
// will error "read /dev/ptmx: input/output error" if process panicked
if err = ep.waitReady(); err != nil {
cx.t.Fatalf("failed to start from snapshot restore (%v)", err)
}
}
../bin/etcd-4461: 2018-01-07 09:19:50.258428 I | raft: ca50e9357181d758 became follower at term 2
../bin/etcd-4461: 2018-01-07 09:19:50.258446 C | raft: tocommit(9) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost?
../bin/etcd-4402: 2018-01-07 09:19:50.258503 I | rafthttp: established a TCP streaming connection with peer ca50e9357181d758 (stream MsgApp v2 reader)
../bin/etcd-4461: panic: tocommit(9) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost?
../bin/etcd-4461:
../bin/etcd-4461: goroutine 116 [running]:
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420161fa0, 0x102290f, 0x5d, 0xc421070140, 0x2, 0x2)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x16d
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc4200e00e0, 0x9)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/log.go:191 +0x15c
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc4201fe200, 0x8, 0xca50e9357181d758, 0x5ac8aa22f1eb4c8f, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1195 +0x54
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.stepFollower(0xc4201fe200, 0x8, 0xca50e9357181d758, 0x5ac8aa22f1eb4c8f, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1141 +0x439
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc4201fe200, 0x8, 0xca50e9357181d758, 0x5ac8aa22f1eb4c8f, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:869 +0x1465
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*node).run(0xc4201e2540, 0xc4201fe200)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:323 +0x113e
../bin/etcd-4461: created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.RestartNode
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:223 +0x321
3.2, 3.3, master all panic, and it's expected.
The restored member joins the cluster with commit index 3 (equal to the number of nodes in the cluster) because snapshot file has no information about revision or any other raft fields from previous cluster. So, if other peers increments its indexes and newly joined member does not become the leader, it will ask for future index thus panics.
I think we just need to document this clearly, snapshot restore only supports fresh cluster, not for new member to existing cluster.
@lyddragon Closing because this is expected.
creates new etcd data directories; all members should restore using the same snapshot. Restoring overwrites some snapshot metadata (specifically, the member ID and cluster ID); the member loses its former identity. This metadata overwrite prevents the new member from inadvertently joining an existing cluster. Therefore in order to start a cluster from a snapshot, the restore must start a new logical cluster.
https://github.com/coreos/etcd/blob/master/Documentation/op-guide/recovery.md#restoring-a-cluster
Snapshot restores a fresh cluster, thus cannot join the existing cluster unless all other members are restored from same snapshot file.
https://github.com/coreos/etcd/issues/9094#issuecomment-355221086 https://github.com/coreos/etcd/issues/9094#issuecomment-355222565 https://github.com/coreos/etcd/issues/9094#issuecomment-355224402