Closed brk0v closed 4 years ago
Is this reproducible with current master?
Yes, you could reproduce this with the latest commit 583763261f1c843e07c1bf7fea5fb4cfb684fe87
Appreciate if someone could confirm this and share ideas how it could be fixed. I tried rearrange data saving steps to the WAL with some code changes, but it seems like we need to do all saves in one transaction like dgraph does https://github.com/dgraph-io/dgraph/blob/master/raftwal/storage.go#L574
@brk0v
I am aware of this issue. Would you like to spend some time to get it fixed?
Sure thing, but if you already have an idea to check or a direction to dig, I could try to do that.
Added code that passed my local failpoints test. Could you please check if this approach makes sense. Thank you!
P.S. Unit tests are broken because of interface changes in etcdserver.Storage
code.
Tests now pass.
To summarise what were done:
snapshotter.LoadIndex()
to load arbitrary snapshot;etcdserver.storage.checkWALSnap()
to check that snapshot could be used to load raft state from the WAL;etcdserver.storage.Release()
and etcdserver.storage. Sync()
to provide the safe order of saving operations; raft.Ready
handles not emptyrd.Snapshot
:
@brk0v Thanks. I will give this a careful look over the next couple of weeks.
cc @jpbetz
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
I'm working on rebasing #10356 on master and making a couple adjustments to it (retaining the commits from @brk0v). I'll send out a PR shortly.
Issue
Node that was offline more than
max(SnapshotCount, DefaultSnapshotCatchUpEntries)
corrupts its WAL log with badHardState.Commit
number if it's killed right afterHardState
was saved to non-volatile storage (failpoint: raftBeforeSaveSnap
).Specific
Version: master Environment: any (tested on Linux, MacOS X)
Steps to reproduce
Procfile
with a failpointraftBeforeSaveSnap
for etcd2 node :Start cluster
Start write loop:
Stop etcd2 node for a "maintenance":
Start etcd2 node after 10 entries have been written to the master to trigger snapshot restore:
Should get a failoint panic, that emulates power issue during restoring from a snapshot.
From now WAL on the etcd2 node is corrupted. It was saved with a
HardState
entry that containsCommit
number from the snapshot, but snapshot was never saved to WAL and disk.WAL is corrupted.
Error: