gardener / etcd-backup-restore

Collection of components to backup and restore the etcd of a Kubernetes cluster.
Apache License 2.0
287 stars 100 forks source link

[BUG] Etcdbrctl restore throws FATAL - results in event revision 1 higher than expected `snapshot revision verification failed for delta snapshot` #583

Open Starblade42 opened 1 year ago

Starblade42 commented 1 year ago

Etcdbrctl snapshot revision verification failed for delta snapshot

Describe the bug: When doing a restore from a full backup and incremental backups, we encountered a failure mid-restore:

FATA[0156] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002039293-13002040428-1675782244 : mismatched event revision while applying delta snapshot, expected 13002040428 but applied 13002040429

This is after restoring ~15 of ~30 incremental backups.

It actually occurred at different places across multiple attempts to restore:

FATA[0159] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002040429-13002041241-1675782247 : mismatched event revision while applying delta snapshot, expected 13002041241 but applied 13002041242
FATA[0157] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002041242-13002042199-1675782251 : mismatched event revision while applying delta snapshot, expected 13002042199 but applied 13002042200

Full backup schedule is: */30 * * * * Incremental backup schedule: 60s Storage retention policy is Exponential

Expected behavior:

I'd either expect the restoration to complete with the same event revision, or if it's okay that the restored revision is a smidge higher, that it could info-log or warn-log and continue.

Ultimately, I'd expect the restoration to work from these incremental backups.

The main question is: is it okay for the revision in the etcd process we're using for the restore for the increment to be slightly higher? Is that actually a fatal error, or something expected when we've restored a 6.7 GB Full Backup, and ~15 or so 10 mb incremental snapshots?

If it's expected, can we info/warn and continue, or is this actually fatal?

How To Reproduce (as minimally and precisely as possible):

This is the first time I've seen this, but it's been on an Etcd cluster with 3 members, with ~750 namespaces and a typical Full snapshot size of 6 GB, and incremental snapshots between 10 MB and 20 MB.

After taking etcd offline, and deleting the member and WAL file directories and restoring, somewhere in the middle of those incremental backups will end up with a event revision that is 1 higher than expected.

Logs:

INFO[0157] Applying delta snapshot Incr-13002037451-13002038085-1675782235  actor=restorer
INFO[0157] Applying delta snapshot Incr-13002038086-13002039292-1675782240  actor=restorer
INFO[0157] Applying delta snapshot Incr-13002039293-13002040428-1675782244  actor=restorer
INFO[0158] Applying delta snapshot Incr-13002040429-13002041241-1675782247  actor=restorer
INFO[0158] Cleanup complete                              actor=restorer
ERRO[0158] Restoration failed.                           actor=restorer
{"level":"info","ts":"2023-02-07T16:30:50.176Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"default","data-dir":"/mnt/ramdisk/etcd","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:558","msg":"stopping serving peer traffic","address":"127.0.0.1:11233"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:565","msg":"stopped serving peer traffic","address":"127.0.0.1:11233"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"default","data-dir":"/mnt/ramdisk/etcd","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
FATA[0159] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002040429-13002041241-1675782247 : mismatched event revision while applying delta snapshot, expected 13002041241 but applied 13002041242

Screenshots (if applicable):

Environment (please complete the following information):

Anything else we need to know?:

We are using etcdbrctl just as a backup agent most times with a sort of hand-rolled etcd systemd quorum. We've been using it successfully for several years, but wanted to improve our mean time to recovery with a snapshotting and restore tool.

I really wanted to use it in the Server mode, but that would have required completely refactoring or even rewriting our etcd bootstrap to harmonize with the Server so, it's a systemd unit that backs up, and when it has to restore, we then run the restore ourselves after our Prometheus alerts tell us the quorum is bogus.

ishan16696 commented 1 year ago

Right now, we also don't have much information regarding this scenario as this seems to be a very rare case. I will check what is the issue here.

danV111 commented 1 year ago

Hello! Are there any updates on this?

Starblade42 commented 1 year ago

So, in our case, we used a workaround. We use the copy function of etcdbrctl to get the last full backup, and then we only restore from the last full backup, and this causes no issues. This isn't ideal, but it's not typically going to result in huge impact, especially when there are no actual alternatives if we completely lost etcd quorum. We only use this method if the typical full snapshot restore + all incremental restores fails

This issue does not occur every time, but often occurs on our largest k8s clusters running with the largest number of objects. With the settings

We end up with typical Delta backups of ~10Mb

Somewhere around restoring from a Full backup with ~15 Delta backups is when we see this issue, and it's usually off by about 1 etcd commit.

If we use more frequent settings then this issue doesn't seem to occur. (For example, 20 minute full backups, 30 second incrementals)