Open Starblade42 opened 1 year ago
Right now, we also don't have much information regarding this scenario as this seems to be a very rare case. I will check what is the issue here.
Hello! Are there any updates on this?
So, in our case, we used a workaround. We use the copy
function of etcdbrctl
to get the last full backup, and then we only restore from the last full backup, and this causes no issues. This isn't ideal, but it's not typically going to result in huge impact, especially when there are no actual alternatives if we completely lost etcd quorum. We only use this method if the typical full snapshot restore + all incremental restores fails
This issue does not occur every time, but often occurs on our largest k8s clusters running with the largest number of objects. With the settings
30m
1m
We end up with typical Delta backups of ~10Mb
Somewhere around restoring from a Full backup with ~15 Delta backups is when we see this issue, and it's usually off by about 1 etcd commit.
If we use more frequent settings then this issue doesn't seem to occur. (For example, 20 minute full backups, 30 second incrementals)
Etcdbrctl
snapshot revision verification failed for delta snapshot
Describe the bug: When doing a restore from a full backup and incremental backups, we encountered a failure mid-restore:
This is after restoring ~15 of ~30 incremental backups.
It actually occurred at different places across multiple attempts to restore:
Full backup schedule is:
*/30 * * * *
Incremental backup schedule:60s
Storage retention policy isExponential
Expected behavior:
I'd either expect the restoration to complete with the same event revision, or if it's okay that the restored revision is a smidge higher, that it could info-log or warn-log and continue.
Ultimately, I'd expect the restoration to work from these incremental backups.
The main question is: is it okay for the revision in the etcd process we're using for the restore for the increment to be slightly higher? Is that actually a fatal error, or something expected when we've restored a 6.7 GB Full Backup, and ~15 or so 10 mb incremental snapshots?
If it's expected, can we info/warn and continue, or is this actually fatal?
How To Reproduce (as minimally and precisely as possible):
This is the first time I've seen this, but it's been on an Etcd cluster with 3 members, with ~750 namespaces and a typical Full snapshot size of 6 GB, and incremental snapshots between 10 MB and 20 MB.
After taking etcd offline, and deleting the member and WAL file directories and restoring, somewhere in the middle of those incremental backups will end up with a event revision that is 1 higher than expected.
Logs:
Screenshots (if applicable):
Environment (please complete the following information):
Anything else we need to know?:
We are using etcdbrctl just as a backup agent most times with a sort of hand-rolled etcd systemd quorum. We've been using it successfully for several years, but wanted to improve our mean time to recovery with a snapshotting and restore tool.
I really wanted to use it in the Server mode, but that would have required completely refactoring or even rewriting our etcd bootstrap to harmonize with the Server so, it's a systemd unit that backs up, and when it has to restore, we then run the restore ourselves after our Prometheus alerts tell us the quorum is bogus.