Closed tangcong closed 4 years ago
cc @mitake
this may be caused by the same bug
Fixed by #11652.
@tangcong Thanks again for fixing this! Could you expand the following statement, I am trying to understand if / how etcd fails to apply a command but reports no error. Does etcd client receive error when this happens?
After that, if leader auth revision is smaller than follower auth revision, follower will fail to apply command, and there won't be any error message in etcd log.
func (a *applierV3backend) Apply(r *pb.InternalRaftRequest) *applyResult {
ar := &applyResult{}
defer func(start time.Time) {
warnOfExpensiveRequest(a.s.getLogger(), start, &pb.InternalRaftStringer{Request: r}, ar.resp, ar.err)
}(time.Now())
ar.err msg is "auth revision is old" when auth revision is inconsistent. i think it is necessary to optimize it by adding an error log and etcd_server_proposals_fail_applied_total metrics. how do you think? i can submit another pr to optimize it.
The client did not receive an error because the apply command for the node to which the client connected was successful.
@jingyih
I see. So client may or may not see error "auth revision is old", depending on which etcd in the cluster serves it. It is the same raft log entry replicated to all 3 etcd members, but they apply it differently due to the fact that their auth store revisions are inconsistent. It feels like the inconsistency issue is amplified, in the sense that it starts with re-applying one or few auth related raft log entries, but later this leads to growing inconsistency in mvcc store.
Adding a log warning when "auth revision is old" happens sounds good to me.
Not sure about etcd_server_proposals_fail_applied_total
metric. Technically, the raft entry is successfully applied (but differently on each server). I checked the existing metrics, there is (EDIT: I just realized this is not failing in mvcc)etcd_mvcc_put_total
. Maybe we could have something like etcd_debugging_mvcc_failed_put_total
? debugging
just means the metric is new and is not stable yet. Later we can choose to make it stable or remove it, depending on actually how useful it is.
When auth store inconsistency leads to inconsistency in mvcc store, we should be able to tell from mvcc metrics such as etcd_mvcc_put_total
? @tangcong
Yes. etcd_mvcc_put_total metric is useful, it is different in every etcd member when auth store is inconsistent. however, it is also different when there are many write requests. it is a little difficult for us to choose a reasonable alarm threshold when we configure alarm rules. @jingyih
Yes. etcd_mvcc_put_total metric is useful, it is different in every etcd member when auth store is inconsistent. however, it is also different when there are many write requests. it is a little difficult for us to choose a reasonable alarm threshold when we configure alarm rules. @jingyih
Right, I understand etcd_mvcc_put_total
is different on each member, depending on the progress of each member's apply. When auth store revision is corrupted on a member, will all the following key-value requests on this member be rejected due to auth revision is old? If this is the case, we can expect to see the that mvcc metric to diverge very fast?
@tangcong Maybe also consider trying alpha feature --experimental-corrupt-check-time
if you upgrade to v3.4.4 in future. It does periodic corruption checks among servers in a cluster.
@jingyih yes.we can expect to see the that mvcc metric to diverge fast. At that time, our ETCD monitoring system issued an alert by comparing the number of KEYs at each node.
When will this bugfix be ported to release 3.3 / release 3.4? @jingyih
@jingyih i try to enable corruption check and find that it is very slow and often timed out when there are a lots of keys(for example,1 million keys). the timeout is unmodifiable. can the timeout parameter be configured?
i try to enable corruption check and find that it is very slow and often timed out when there are a lots of keys(for example,1 million keys). the timeout is unmodifiable. can the timeout parameter be configured?
Unfortunately the timeout is not configurable for now. There should be 5+ seconds for each remote API call for fetching the hash from peers, is it not enough? Could you try etcdctl endpoint hashkv
and see how long it takes?
10893 incremental corruption check sounds good.
Yes all ideas on making corruption check better in etcd are welcome.
etcd version is 3.4.3, three nodes, initial corruption check takes 30 second.
14730644:Mar 4 00:46:39 localhost etcd[7346]: 2a111ffbb45ec018 starting initial corruption check with timeout 15s...
14730689:Mar 4 00:47:09 localhost etcd[7346]: 2a111ffbb45ec018 succeeded on initial corruption checking: no corruption
14730691:Mar 4 00:47:09 localhost etcd[7346]: enabled corruption checking with 3m0s interval
periodic corruption check also has error logs:
Mar 04 00:47:09 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.149:2380" with revision 395483904
Mar 04 00:46:54 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.53:2380" with revision 395483904
however,etcdctl endpoint hashkv is very fast~
root@VM-0-105-ubuntu:~/# time ETCDCTL_API=3 etcdctl endpoint hashkv --endpoints https://x.x.x.53:2379 --cluster
https://x.105:2379, 1938650550
https://x.53:2379, 885147713
https://x.149:2379, 1965074925
real 0m0.228s
user 0m0.032s
sys 0m0.004s
@jingyih
etcd version is 3.4.3, three nodes, initial corruption check takes 30 second.
14730644:Mar 4 00:46:39 localhost etcd[7346]: 2a111ffbb45ec018 starting initial corruption check with timeout 15s... 14730689:Mar 4 00:47:09 localhost etcd[7346]: 2a111ffbb45ec018 succeeded on initial corruption checking: no corruption 14730691:Mar 4 00:47:09 localhost etcd[7346]: enabled corruption checking with 3m0s interval
periodic corruption check also has error logs:
Mar 04 00:47:09 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.149:2380" with revision 395483904 Mar 04 00:46:54 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.53:2380" with revision 395483904
however,etcdctl endpoint hashkv is very fast~
root@VM-0-105-ubuntu:~/# time ETCDCTL_API=3 etcdctl endpoint hashkv --endpoints https://x.x.x.53:2379 --cluster https://x.105:2379, 1938650550 https://x.53:2379, 885147713 https://x.149:2379, 1965074925 real 0m0.228s user 0m0.032s sys 0m0.004s
@jingyih
i see. leader failed to connect other etcd member in getPeerHashKVs function. corruption check is not expensive when cluster has 1 million keys. you have fixed it in this pr #11636(v3.4.4). most of our cluster versions are now 3.3.17. In the future, we will try this feature after upgrading according to the actual situation. thanks.
@tangcong Good to hear. Thanks for sharing the info.
What happened:
Recently, our team(TencentCloud k8s team) encountered a serious etcd data inconsistency bug. k8s resources such as node, pods, service, and deployment were not found when you use kubectl to get resource. The cluster did not work when you deploy/update workload.
How to trouble-shooting it:
the cluster status information is as follows. you can see that node-1, node-2, node-3 have same raftIndex,but node-2's revision is different from others. the number of keys per node is also inconsistent.for example, some keys are on the leader, but do not exist on the follower node. after adding the simple debug log, we found that the reason the follower node failed to apply command is that its auth revision is smaller than the leader.
follower can received leader's command, we exclude cluster split brain and the implementation of raft algorithm bugs. why follower node auth revision is less than leader node?
we add debugging log and develop a simple chaos monkey tool to reproduce it. after running for a few days, we successfully reproduced. in our debugging log, we can see that the consistentIndex is repeated and apply some commands again when the etcd restarted:
We found that when executing auth command, the consistent index is not persistent. some commands(for example, GrantRolePermission) applyed again,it will also increase auth revision.
How to reproduce it (as minimally and precisely as possible):
now you can found that the auth revision of this node have increased after restart etcd, though we didn't do any auth operation. and in other node which not be restarted, the auth revision is not changed.
After that, if leader auth revision is smaller than follower auth revision, follower will fail to apply command, and there won't be any error message in etcd log. then, the node will have inconsistent data and different revision, and you may get data from one node is ok but another one is not, like this:
How to fix it:
we will submit a pr to address this serious bug. it will persist consistentIndex into the backend store when executing auth commands.
Impact:
it is possible to encounter data inconsistency/loss for all etcd3 version when enable auth.
The above description is a little unclear. add the following description: whether a write request command can be executed successfully depends on which node the client is connected to. It has nothing to do with which is the leader.
for example. there are three nodes(A,B,C), A auth revision is 1, B is 2, C is 3.
if node A send write request,the request entry auth revision is 1, node B,C fail to apply entry command.