Starblade42 commented 1 year ago

Etcdbrctl `snapshot revision verification failed for delta snapshot`

Describe the bug: When doing a restore from a full backup and incremental backups, we encountered a failure mid-restore:

FATA[0156] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002039293-13002040428-1675782244 : mismatched event revision while applying delta snapshot, expected 13002040428 but applied 13002040429

This is after restoring ~15 of ~30 incremental backups.

It actually occurred at different places across multiple attempts to restore:

FATA[0159] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002040429-13002041241-1675782247 : mismatched event revision while applying delta snapshot, expected 13002041241 but applied 13002041242

FATA[0157] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002041242-13002042199-1675782251 : mismatched event revision while applying delta snapshot, expected 13002042199 but applied 13002042200

Full backup schedule is: */30 * * * * Incremental backup schedule: 60s Storage retention policy is Exponential

Expected behavior:

I'd either expect the restoration to complete with the same event revision, or if it's okay that the restored revision is a smidge higher, that it could info-log or warn-log and continue.

Ultimately, I'd expect the restoration to work from these incremental backups.

The main question is: is it okay for the revision in the etcd process we're using for the restore for the increment to be slightly higher? Is that actually a fatal error, or something expected when we've restored a 6.7 GB Full Backup, and ~15 or so 10 mb incremental snapshots?

If it's expected, can we info/warn and continue, or is this actually fatal?

How To Reproduce (as minimally and precisely as possible):

This is the first time I've seen this, but it's been on an Etcd cluster with 3 members, with ~750 namespaces and a typical Full snapshot size of 6 GB, and incremental snapshots between 10 MB and 20 MB.

After taking etcd offline, and deleting the member and WAL file directories and restoring, somewhere in the middle of those incremental backups will end up with a event revision that is 1 higher than expected.

Logs:

INFO[0157] Applying delta snapshot Incr-13002037451-13002038085-1675782235  actor=restorer
INFO[0157] Applying delta snapshot Incr-13002038086-13002039292-1675782240  actor=restorer
INFO[0157] Applying delta snapshot Incr-13002039293-13002040428-1675782244  actor=restorer
INFO[0158] Applying delta snapshot Incr-13002040429-13002041241-1675782247  actor=restorer
INFO[0158] Cleanup complete                              actor=restorer
ERRO[0158] Restoration failed.                           actor=restorer
{"level":"info","ts":"2023-02-07T16:30:50.176Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"default","data-dir":"/mnt/ramdisk/etcd","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:558","msg":"stopping serving peer traffic","address":"127.0.0.1:11233"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:565","msg":"stopped serving peer traffic","address":"127.0.0.1:11233"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"default","data-dir":"/mnt/ramdisk/etcd","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
FATA[0159] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002040429-13002041241-1675782247 : mismatched event revision while applying delta snapshot, expected 13002041241 but applied 13002041242

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID :
```
etcdctl version: 3.5.0
API version: 3.5
```

Etcd-backup-restore version/commit ID:

INFO[0000] etcd-backup-restore Version: v0.20.1
INFO[0000] Git SHA: 0893f648
INFO[0000] Go Version: go1.18.6
INFO[0000] Go OS/Arch: linux/amd64

Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]: Azure

Anything else we need to know?:

We are using etcdbrctl just as a backup agent most times with a sort of hand-rolled etcd systemd quorum. We've been using it successfully for several years, but wanted to improve our mean time to recovery with a snapshotting and restore tool.

I really wanted to use it in the Server mode, but that would have required completely refactoring or even rewriting our etcd bootstrap to harmonize with the Server so, it's a systemd unit that backs up, and when it has to restore, we then run the restore ourselves after our Prometheus alerts tell us the quorum is bogus.

ishan16696 commented 1 year ago

Right now, we also don't have much information regarding this scenario as this seems to be a very rare case. I will check what is the issue here.

danV111 commented 1 year ago

Hello! Are there any updates on this?

Starblade42 commented 1 year ago

So, in our case, we used a workaround. We use the copy function of etcdbrctl to get the last full backup, and then we only restore from the last full backup, and this causes no issues. This isn't ideal, but it's not typically going to result in huge impact, especially when there are no actual alternatives if we completely lost etcd quorum. We only use this method if the typical full snapshot restore + all incremental restores fails

This issue does not occur every time, but often occurs on our largest k8s clusters running with the largest number of objects. With the settings

Full backups every 30m
Incremental backups every 1m

We end up with typical Delta backups of ~10Mb

Somewhere around restoring from a Full backup with ~15 Delta backups is when we see this issue, and it's usually off by about 1 etcd commit.

If we use more frequent settings then this issue doesn't seem to occur. (For example, 20 minute full backups, 30 second incrementals)

gardener / etcd-backup-restore

[BUG] Etcdbrctl restore throws FATAL - results in event revision 1 higher than expected `snapshot revision verification failed for delta snapshot` #583

Etcdbrctl `snapshot revision verification failed for delta snapshot`

Expected behavior:

How To Reproduce (as minimally and precisely as possible):

Logs:

Environment (please complete the following information):

Anything else we need to know?:

gardener / etcd-backup-restore

[BUG] Etcdbrctl restore throws FATAL - results in event revision 1 higher than expected `snapshot revision verification failed for delta snapshot` #583

Etcdbrctl snapshot revision verification failed for delta snapshot

Expected behavior:

How To Reproduce (as minimally and precisely as possible):

Logs:

Environment (please complete the following information):

Anything else we need to know?:

Etcdbrctl `snapshot revision verification failed for delta snapshot`