Open emaildanwilson opened 2 months ago
@emaildanwilson @jvassev thanks for opening the issue.
Are you folks seeing this on the newly released patch version v0.28.1
only, or did you observe this behavior in v0.28.0
and before as well?
Will look into this, since we haven't observed this behavior yet on v0.28.1
.
@emaildanwilson @jvassev The Etcd and backup-restore versions you're using are not compatible anymore, pls see the release notes of v0.25.0. You're using older version of Etcd custom image ( which uses a base image ) with newer version of etcdbr
which doesn't use a base image.
To make it work, you have two options,
1) Either use a etcdbr
version from v0.24.x
series, preferably the latest i.e v0.24.9
which is compatible with the Etcd image you are using
2) Or to use the etcdbr v0.28.1
, you need to use etcd-wrapper
instead of Etcd custom image
.
If this works, you can close the issue. If it doesn't, please let us know :)
Thank you! We'll make the change and see if the issue still happens.
@anveshreddy18 Thanks for the advice! I decided to downgrade etcdbr
to v0.24.9 as a first step.
For doing a proper upgrade to etcd-wrapper
is this a good starting point https://github.com/gardener/etcd-wrapper/blob/main/hack/local-dev/manifests/templates/etcd.sts.template.yaml?
@jvassev @emaildanwilson , I'd say that's a good enough starting point along with the other files in the template
directory. etcd-wrapper
is a drop-in replacement for etcd-custom-image
. You can also look at the documentation, specifically deployment/configuring-etcd-wrapper.md
I would also suggest you take a look at etcd-druid
, an operator for etcd clusters. It provisions an etcd cluster and the corresponding resources through a simple manifest and it also handles the lifecycle of all the corresponding resources necessary to deploy an etcd cluster.
We have quite a major release coming up as well (in the next couple weeks), v0.23.0
of etcd-druid
which makes the operator more robust and also provides a (opt-in) webhook which prevents deletion of kubernetes resources that are provisioned for the functioning of the etcd cluster, until the cluster is explicitly deleted by deleting the Etcd
CR as defined by etcd-druid
, which represents an etcd cluster.
Let us know if v0.24.9
currently works as expected with etcd-custom-image
, and if v0.28.0+
works as expected with etcd-wrapper
.
We strongly urge that you move away from etcd-custom-image
to etcd-wrapper
since we will stop supporting etcd-custom-image
very soon; since etcd-wrapper
is significantly better from both a development, operations, and security perspective.
/assign
Hi @jvassev @emaildanwilson , Thanks for reporting the issue. I take a look why restoration failed
FATA[0002] Failed to restore snapshot: failed to restore from the base snapshot: snapshot missing hash but --skip-hash-check=false
It looks like etcd failed to append hash
of full snapshot while sending the full snapshot to backup-restore client.
Basically, to take full-snapshot
backup-restore leverages the etcd API call snapshot which returns the snapshot of etcd and while doing the restoration, backup-restore checks the hash value, which seems to be missing in this case, hence the backup-restore throws an error.
--skip-hash-check=true
flag is meant to be used only when restoration is triggered to restore etcd from copied db
file of etcd instead of snapshot taken.
We have to detect the corrupt snapshot or missing hash of snapshot (taken via snapshot) early rather than waiting till restoration.
I have following methods in my mind:
This method will work but starting an embedded etcd while taking a full-snapshot and wait for restoration to complete can be a little time taking but it still, it will safe guard us from such scenario where restoration can be failed due to corrupt snapshot.
x
revision.But, I'm not sure about how to calculate the hash of db till x revision ? Is there any api call available for that ?
I guess api call HashKV won't work here as value return by HashKV api call till x revision can't be equal to the Hash of snapshot taken upto x revision (removing the appended hash) as HashKV api call calculates the hash of all MVCC key-values, whereas snapshot is snapshot of etcd db which also contains cluster information, hence the hash will not be same.
Running with the suggested versions:
europe-docker.pkg.dev/gardener-project/public/gardener/etcd:v3.4.13-bootstrap-1
europe-docker.pkg.dev/gardener-project/releases/gardener/etcdbrctl:v0.24.9
Occasionally this happens:
FATA[0019] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-55795465-55796279-1721466724.gz : mismatched event revision while applying delta snapshot, expected 55796279 but applied 55797093
--skip-hash-check
sometimes leads to segfaults.
I think it may have to do abrupt node restarts since we mostly see this on spot instances.
Also including the full log from an attempted restore:
INFO[0000] Temporary directory /tmp/snapshots does not exist. Creating it...
INFO[0000] Finding latest set of snapshot to recover from...
INFO[0000] Restoring from base snapshot: Full-00000000-55795464-1721466605.gz actor=restorer
INFO[0000] start decompressing the snapshot with gzip compressionPolicy actor=de-compressor
INFO[0011] successfully fetched data of base snapshot in 11.019503244 seconds [CompressionPolicy:gzip] actor=restorer
{"level":"info","ts":1721472682.9959612,"caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":55788239}
{"level":"info","ts":1721472683.3767295,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
INFO[0012] Attempting to apply 29 delta snapshots for restoration. actor=restorer
INFO[0012] Creating temporary directory /tmp/restoration-snapshots for persisting delta snapshots locally. actor=restorer
INFO[0012] Starting an embedded etcd server... actor=restorer
{"level":"info","ts":"2024-07-20T10:51:23.387Z","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:0"]}
{"level":"info","ts":"2024-07-20T10:51:23.387Z","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["http://localhost:0"]}
{"level":"info","ts":"2024-07-20T10:51:23.387Z","caller":"embed/etcd.go:302","msg":"starting an etcd server","etcd-version":"3.4.13","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.20.3","go-os":"linux","go-arch":"amd64","max-cpu-set":20,"max-cpu-available":20,"member-initialized":true,"name":"default","data-dir":"/tmp/data","wal-dir":"","wal-dir-dedicated":"","member-dir":"/tmp/data/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:0"],"listen-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"],"listen-client-urls":["http://localhost:0"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":8589934592,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"30m0s","auto-compaction-interval":"30m0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2024-07-20T10:51:23.511Z","caller":"etcdserver/backend.go:80","msg":"opened backend db","path":"/tmp/data/member/snap/db","took":"123.637015ms"}
{"level":"info","ts":"2024-07-20T10:51:23.511Z","caller":"etcdserver/server.go:451","msg":"recovered v2 store from snapshot","snapshot-index":1,"snapshot-size":"6.5 kB"}
{"level":"info","ts":"2024-07-20T10:51:23.518Z","caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":55788239}
{"level":"info","ts":"2024-07-20T10:51:23.888Z","caller":"etcdserver/server.go:469","msg":"recovered v3 backend from snapshot","backend-size-bytes":461660160,"backend-size":"462 MB","backend-size-in-use-bytes":418439168,"backend-size-in-use":"418 MB"}
{"level":"info","ts":"2024-07-20T10:51:23.895Z","caller":"etcdserver/raft.go:536","msg":"restarting local member","cluster-id":"cdf818194e3a8c32","local-member-id":"8e9e05c52164694d","commit-index":1}
{"level":"info","ts":"2024-07-20T10:51:23.895Z","caller":"raft/raft.go:1530","msg":"8e9e05c52164694d switched to configuration voters=(10276657743932975437)"}
{"level":"info","ts":"2024-07-20T10:51:23.895Z","caller":"raft/raft.go:700","msg":"8e9e05c52164694d became follower at term 1"}
{"level":"info","ts":"2024-07-20T10:51:23.895Z","caller":"raft/raft.go:383","msg":"newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 1, commit: 1, applied: 1, lastindex: 1, lastterm: 1]"}
{"level":"info","ts":"2024-07-20T10:51:23.895Z","caller":"membership/cluster.go:256","msg":"recovered/added member from store","cluster-id":"cdf818194e3a8c32","local-member-id":"8e9e05c52164694d","recovered-remote-peer-id":"8e9e05c52164694d","recovered-remote-peer-urls":["http://localhost:2380"]}
{"level":"warn","ts":"2024-07-20T10:51:23.896Z","caller":"auth/store.go:1366","msg":"simple token is not cryptographically signed"}
{"level":"info","ts":"2024-07-20T10:51:23.896Z","caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":55788239}
{"level":"info","ts":"2024-07-20T10:51:24.252Z","caller":"etcdserver/quota.go:126","msg":"enabled backend quota","quota-name":"v3-applier","quota-size-bytes":8589934592,"quota-size":"8.6 GB"}
{"level":"info","ts":"2024-07-20T10:51:24.253Z","caller":"etcdserver/server.go:803","msg":"starting etcd server","local-member-id":"8e9e05c52164694d","local-server-version":"3.4.13","cluster-version":"to_be_decided"}
{"level":"info","ts":"2024-07-20T10:51:24.253Z","caller":"etcdserver/server.go:669","msg":"started as single-node; fast-forwarding election ticks","local-member-id":"8e9e05c52164694d","forward-ticks":9,"forward-duration":"900ms","election-ticks":10,"election-timeout":"1s"}
{"level":"info","ts":"2024-07-20T10:51:24.253Z","caller":"embed/etcd.go:244","msg":"now serving peer/client/metrics","local-member-id":"8e9e05c52164694d","initial-advertise-peer-urls":["http://localhost:0"],"listen-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"],"listen-client-urls":["http://localhost:0"],"listen-metrics-urls":[]}
{"level":"info","ts":"2024-07-20T10:51:24.253Z","caller":"embed/etcd.go:579","msg":"serving peer traffic","address":"127.0.0.1:33861"}
{"level":"info","ts":"2024-07-20T10:51:24.996Z","caller":"raft/raft.go:923","msg":"8e9e05c52164694d is starting a new election at term 1"}
{"level":"info","ts":"2024-07-20T10:51:25.003Z","caller":"raft/raft.go:713","msg":"8e9e05c52164694d became candidate at term 2"}
{"level":"info","ts":"2024-07-20T10:51:25.003Z","caller":"raft/raft.go:824","msg":"8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 2"}
{"level":"info","ts":"2024-07-20T10:51:25.003Z","caller":"raft/raft.go:765","msg":"8e9e05c52164694d became leader at term 2"}
{"level":"info","ts":"2024-07-20T10:51:25.003Z","caller":"raft/node.go:325","msg":"raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 2"}
{"level":"info","ts":"2024-07-20T10:51:25.003Z","caller":"etcdserver/server.go:2037","msg":"published local member to cluster through raft","local-member-id":"8e9e05c52164694d","local-member-attributes":"{Name:default ClientURLs:[http://localhost:0]}","request-path":"/0/members/8e9e05c52164694d/attributes","cluster-id":"cdf818194e3a8c32","publish-timeout":"7s"}
INFO[0014] Embedded server is ready to listen client at: 127.0.0.1:44425 actor=restorer
INFO[0014] Applying delta snapshots... actor=restorer
{"level":"info","ts":"2024-07-20T10:51:25.003Z","caller":"etcdserver/server.go:2528","msg":"setting up initial cluster version","cluster-version":"3.4"}
{"level":"info","ts":"2024-07-20T10:51:25.003Z","caller":"membership/cluster.go:558","msg":"set initial cluster version","cluster-id":"cdf818194e3a8c32","local-member-id":"8e9e05c52164694d","cluster-version":"3.4"}
{"level":"info","ts":"2024-07-20T10:51:25.003Z","caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.4"}
{"level":"info","ts":"2024-07-20T10:51:25.003Z","caller":"etcdserver/server.go:2560","msg":"cluster version is updated","cluster-version":"3.4"}
{"level":"info","ts":"2024-07-20T10:51:25.003Z","caller":"embed/serve.go:139","msg":"serving client traffic insecurely; this is strongly discouraged!","address":"127.0.0.1:44425"}
INFO[0014] Applying first delta snapshot Incr-55795465-55795626-1721466663.gz actor=restorer
INFO[0014] start decompressing the snapshot with gzip compressionPolicy actor=de-compressor
INFO[0014] successfully decompressed data of delta snapshot in 0.017085131 seconds [CompressionPolicy:gzip] actor=restorer
{"level":"warn","ts":"2024-07-20T10:51:25.610Z","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"382.210946ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"\\000\" range_end:\"\\000\" limit:1 sort_order:DESCEND sort_target:MOD ","response":"range_response_count:1 size:700"}
{"level":"info","ts":"2024-07-20T10:51:25.610Z","caller":"traceutil/trace.go:145","msg":"trace[970151589] range","detail":"{range_begin:\u0000; range_end:\u0000; response_count:1; response_revision:55795464; }","duration":"382.284741ms","start":"2024-07-20T10:51:25.228Z","end":"2024-07-20T10:51:25.610Z","steps":["trace[970151589] 'range keys from in-memory index tree' (duration: 25.377667ms)","trace[970151589] 'range keys from bolt db' (duration: 291.628859ms)","trace[970151589] 'filter and sort the key-value pairs' (duration: 49.852674ms)"]}
INFO[0014] Applying first delta snapshot Incr-55795465-55795626-1721466663.gz actor=restorer
INFO[0014] Fetcher #1 fetching delta snapshot Incr-55796280-55796561-1721466784.gz actor=restorer
INFO[0014] Fetcher #3 fetching delta snapshot Incr-55795465-55796279-1721466724.gz actor=restorer
INFO[0014] Fetcher #4 fetching delta snapshot Incr-55796560-55796771-1721466844.gz actor=restorer
INFO[0014] Fetcher #2 fetching delta snapshot Incr-55796562-55796773-1721466844.gz actor=restorer
INFO[0014] Fetcher #6 fetching delta snapshot Incr-55796279-55796559-1721466783.gz actor=restorer
INFO[0014] Starting to handle an alarm... actor=restorer
INFO[0014] Fetcher #5 fetching delta snapshot Incr-55795627-55796278-1721466723.gz actor=restorer
INFO[0014] Fetcher #1 fetching delta snapshot Incr-55796772-55797078-1721466904.gz actor=restorer
INFO[0015] Fetcher #4 fetching delta snapshot Incr-55796774-55797078-1721466904.gz actor=restorer
INFO[0015] Fetcher #6 fetching delta snapshot Incr-55797079-55797112-1721467204.gz actor=restorer
INFO[0015] Fetcher #1 fetching delta snapshot Incr-55797079-55797112-1721467205.gz actor=restorer
INFO[0015] Fetcher #2 fetching delta snapshot Incr-55797113-55797756-1721467265.gz actor=restorer
INFO[0015] Fetcher #5 fetching delta snapshot Incr-55797113-55797762-1721467265.gz actor=restorer
INFO[0015] Reading snapshot contents Incr-55795627-55796278-1721466723.gz from raw snapshot file /tmp/restoration-snapshots/Incr-55795627-55796278-1721466723.gz actor=restorer
INFO[0015] start decompressing the snapshot with gzip compressionPolicy actor=de-compressor
INFO[0015] successfully decompressed data of delta snapshot in 0.004005825 seconds [CompressionPolicy:gzip] actor=restorer
INFO[0015] Applying delta snapshot Incr-55795627-55796278-1721466723.gz [2/29] actor=restorer
INFO[0015] Fetcher #4 fetching delta snapshot Incr-55797757-55798069-1721467325.gz actor=restorer
INFO[0015] Fetcher #6 fetching delta snapshot Incr-55797763-55798073-1721467325.gz actor=restorer
INFO[0015] Fetcher #1 fetching delta snapshot Incr-55798070-55798473-1721467385.gz actor=restorer
INFO[0016] Removing temporary delta snapshot events file /tmp/restoration-snapshots/Incr-55795627-55796278-1721466723.gz for snapshot Incr-55795627-55796278-1721466723.gz actor=restorer
INFO[0016] Fetcher #4 fetching delta snapshot Incr-55798074-55798478-1721467385.gz actor=restorer
INFO[0016] Fetcher #6 fetching delta snapshot Incr-55798474-55798783-1721467445.gz actor=restorer
INFO[0016] Fetcher #5 fetching delta snapshot Incr-55798479-55798783-1721467445.gz actor=restorer
INFO[0016] Fetcher #4 fetching delta snapshot Incr-55798784-55798983-1721467745.gz actor=restorer
INFO[0016] Fetcher #1 fetching delta snapshot Incr-55798784-55798984-1721467746.gz actor=restorer
INFO[0016] Fetcher #5 fetching delta snapshot Incr-55798984-55799606-1721467806.gz actor=restorer
INFO[0016] Fetcher #6 fetching delta snapshot Incr-55798985-55799608-1721467806.gz actor=restorer
INFO[0016] Fetcher #4 fetching delta snapshot Incr-55799607-55799858-1721467866.gz actor=restorer
INFO[0016] Fetcher #5 fetching delta snapshot Incr-55799609-55799859-1721467866.gz actor=restorer
INFO[0016] Fetcher #4 fetching delta snapshot Incr-55799859-55800221-1721467926.gz actor=restorer
INFO[0016] Fetcher #1 fetching delta snapshot Incr-55799860-55800221-1721467926.gz actor=restorer
INFO[0016] Fetcher #3 fetching delta snapshot Incr-55800222-55800511-1721467986.gz actor=restorer
INFO[0016] Reading snapshot contents Incr-55795465-55796279-1721466724.gz from raw snapshot file /tmp/restoration-snapshots/Incr-55795465-55796279-1721466724.gz actor=restorer
INFO[0016] start decompressing the snapshot with gzip compressionPolicy actor=de-compressor
INFO[0016] successfully decompressed data of delta snapshot in 0.009425055 seconds [CompressionPolicy:gzip] actor=restorer
INFO[0016] Applying delta snapshot Incr-55795465-55796279-1721466724.gz [3/29] actor=restorer
INFO[0016] Fetcher #6 fetching delta snapshot Incr-55800222-55800511-1721467987.gz actor=restorer
INFO[0018] Cleanup complete actor=restorer
ERRO[0018] Restoration failed. actor=restorer
{"level":"info","ts":"2024-07-20T10:51:29.344Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
INFO[0018] Closing handleAlarm... actor=restorer
{"level":"info","ts":"2024-07-20T10:51:29.359Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"default","data-dir":"/tmp/data","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
{"level":"info","ts":"2024-07-20T10:51:29.359Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
{"level":"info","ts":"2024-07-20T10:51:29.359Z","caller":"embed/etcd.go:558","msg":"stopping serving peer traffic","address":"127.0.0.1:33861"}
{"level":"info","ts":"2024-07-20T10:51:29.359Z","caller":"embed/etcd.go:565","msg":"stopped serving peer traffic","address":"127.0.0.1:33861"}
{"level":"info","ts":"2024-07-20T10:51:29.359Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"default","data-dir":"/tmp/data","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
FATA[0018] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-55795465-55796279-1721466724.gz : mismatched event revision while applying delta snapshot, expected 55796279 but applied 55797093
Describe the bug: Incomplete full backups can get stored onto S3 and the automatic restore fails on startup.
Expected behavior: If a backup is missing the hash it should never get copied to S3 or the restore should be able to handle skipping incomplete backups with an option.
How To Reproduce (as minimally and precisely as possible): We don't have specific reproduction steps but have hit this issue almost daily when using spot instances to run. We suspect that an incomplete file is written and it is packaged up and sent to S3 possibly after a restart but thats just a guess based on the behavior we're seeing.
Logs:
backup files in S3 and there sizes. note that the latest full backup is smaller than the previous
command to create the restore
restore error message
if run with
skip-hash-check=true
it segfaults:Screenshots (if applicable):
Environment (please complete the following information):
Anything else we need to know?:
backup sidecar configuration:
If we manually delete the bad full backup then the restore completes successfully.
@jvassev can provide more details if needed.