Closed tobgu closed 2 years ago
Easily reproducible with script:
#!/bin/bash
./bin/etcd --snapshot-count=5 &
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
kill -9 %1
./bin/etcd --snapshot-count=5
Really great analysis and repro! @tobgu would you be willing to send PR with a fix?
I would have to spend time trying to understand the V2/V3 sync logic for that and why it is failing in this case. If nobody else knows of a quick solution/fix to this I can do that but I was hoping that someone well versed in this area would pick it up. ;-)
Heh, only if I could say that I understand it better. From what I investigated problem is due to consistent index (CI) not being saved when applying Alarms
entry. I managed to confirm that this behavior was introduced between v3.5.2 and v3.5.3. Meaning that this is a bug for in fix for data inconsistency issue :(.
As the result snapshot index is greater than CI, which during etcd bootstrap is interpreted that etcd crushed during process of receiving snapshot from leader. Etcd panics as it assumes that there should be a v3 snapshot received from leader.
Ok, looks like v3.5.3 assumes that performing applyV3 without an error means that consistency index was commited. However that's not true in case of Alarms
, as it doesn't touch db, meaning it doesn't open any transactions, which would lead to executing backend hooks.
Backend hooks bad, again.
Some more details, issue is triggered when etcd crashes after snapshot that was followed by only Alarms entries. It causes a transitory bad state of db, would prevent etcd to recover after crash. It's transitory as db will be fixed after any WAL entry that is not Alarms.
Impact: Low as the issue should be extremely rare.
It requires:
Still it brings up the issue of how incomprehensible etcd apply code is. I think this will be third times we try to fix just this part of the code in v3.5.X
I'm not sure I agree with the conclusion that this is an extremely rare case.
Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...
My current workaround is to not use the default probes in the chart but rather hit the /health HTTP endpoint instead (which doesn't seem to suffer from the same problem).
Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...
Can you link to how Bitnami does healthcheck?
cc @ahrtr @ptabor for their opinion
Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...
Can you link to how Bitnami does healthcheck?
Sure!
This is how the default liveness- and readiness probes and are setup (the ones I've now replaced with http probes against /health
): https://github.com/bitnami/charts/blob/master/bitnami/etcd/templates/statefulset.yaml#L265-L289
And this is the shell script that is called by the above probes: https://github.com/bitnami/bitnami-docker-etcd/blob/master/3.5/debian-11/rootfs/opt/bitnami/scripts/etcd/healthcheck.sh
Thanks @tobgu for raising this issue ( a real issue)!
It turns out to be a regression introduced in 3.5.4 in https://github.com/etcd-io/etcd/pull/13854 (https://github.com/etcd-io/etcd/pull/13908). The alarm list
is the only exception that doesn't move consistent_index forward. The reproduction steps are as simple as,
etcd --snapshot-count=5 &
for i in {1..6}; do etcdctl alarm list; done
kill -9 <etcd_pid>
etcd
Lock batch_tx in (*AlarmStore) Get, so that it calls the txPostLockInsideApplyHook
, and accordingly move consistent_index forward.
Just need to add code something like below into (*AlarmStore) Get. It's the simplest change, but it looks ugly, because it doesn't make sense for alarmList
to acquire the batchTx lock at all.
tx := s.be.BatchTx()
tx.LockInsideApply()
defer tx.Unlock()
Change server.go#L1853-L1855 to something like below. I don't know why I did not do this previously. Will think about this more and get back if I recall something new.
newIndex := s.consistIndex.ConsistentIndex()
if newIndex < e.Index {
s.consistIndex.SetConsistentIndex(e.Index, e.Term)
}
Change the alarmList to use linearizableReadLoop, so that it doesn't go through the raft & applying workflow at all. Accordingly it will not advance the applyIndex at all, and the snapshot Index will not be advanced.
Get rid of the OnPreCommitUnsafe added in 3.5.0 and the txPostLockInsideApplyHook/LockInsideApply/LockOutsideApply
added in main & 3.5.3 & 3.5.4.
I will take care of the long-term solution.
For short-term, solution 2 above looks the best for now. Anyone feel free to deliver a PR.
Is anyone working on this issue?
I'm not working on this. If no one else picks it up I might be able to find some time for it in a couple of weeks. Right now my priorities do not allow it. I'm happy with the current workaround (a change in the health check used) as a short term solution in terms of stability but would of course want this to be fixed as a proper long term solution in order to not have to worry about triggering the behaviour experienced in some other way.
In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28
In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28
This solution fixed our problem in a production etcd
configuration. Just commenting here so more people can follow. Thanks!
In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28
@ahrtr You are a life saver!!!!! This solved my issue like magic. Thank you so much.
What happened?
With a cluster of three nodes setup using https://github.com/bitnami/charts/blob/master/bitnami/etcd/README.md it was noticed that the ETCD nodes failed to start after a forced reboot of the underlying worker nodes. A graceful shutdown will not result in this issue.
The logs indicated a mismatch in the raft log index between the v2 *.snap files and the v3 db file where the index of the snap files was higher than that of the v3 db file causing ETCD to look for a snap.db file that did not exist (see logs).
The index of the snap file was derived from the file name (eg.
0000000000000017-0000000000124f8c.snap
) while the consistent_index of the v3 db was extracted using bbolt,bbolt get db meta consistent_index | hexdump
=>0xb4903
.So far the issue looked very much like what is described in #11949. The "fix" described in that issue to get the cluster up and running again also worked, to remove/move the
*.snap
files.Worth mentioning: This cluster had not had any writes to it for serveral weeks ahead of the reboot. The data in it is mostly read. Doing a proper write to the cluster will set the consistent_index of the v3 DB to an up-to-date value of the raft index.
After some investigation into why this index difference the between the snapshots and the v3 store occurred it was found that the health check executed regularly by Kubernetes was the reason for the version drift.
The health and readiness check regularly executes
etcdctl endpoint health
to determine if the cluster is healthy or not. In ETCD 3.4 this was a simple GET on the health key but since https://github.com/etcd-io/etcd/pull/12150 it also includes checking the alarm list to verify that it is empty. For some reason listing the alarms also triggers a write/apply (see attached logs). And for some reason this apply is only applied to the V2 store, not the V3 store. This cause the index in the V2 store to drift from the V3 store until a proper write is performed. I have not dug into the reason for why the write is performed and why it is missing from the V3 store.The behaviour is only present in this form in 3.5 since the health check in 3.4 does not include listing the alarms.
The problem is easy to reproduce locally. See description.
What did you expect to happen?
I would always expect ETCD to be able to start properly regardless of how the shutdown was done.
How can we reproduce it (as minimally and precisely as possible)?
Locally:
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output