tangcong commented 4 years ago

What happened:

Recently, our team(TencentCloud k8s team) encountered a serious etcd data inconsistency bug. k8s resources such as node, pods, service, and deployment were not found when you use kubectl to get resource. The cluster did not work when you deploy/update workload.

How to trouble-shooting it:

the cluster status information is as follows. you can see that node-1, node-2, node-3 have same raftIndex，but node-2's revision is different from others. the number of keys per node is also inconsistent.for example, some keys are on the leader, but do not exist on the follower node. after adding the simple debug log, we found that the reason the follower node failed to apply command is that its auth revision is smaller than the leader.

[
  {
    "Endpoint": "https://node-1:2379",
    "Status": {
      "header": {
        "cluster_id": 4202817850163301400,
        "member_id": 8591302216124782000,
        "revision": 11,
        "raft_term": 5
      },
      "version": "3.4.3",
      "dbSize": 45056,
      "leader": 8717530244098866000,
      "raftIndex": 392,
      "raftTerm": 5,
      "raftAppliedIndex": 392,
      "dbSizeInUse": 32768
    }
  },
  {
    "Endpoint": "https://node-2:2379",
    "Status": {
      "header": {
        "cluster_id": 4202817850163301400,
        "member_id": 12692962784632463000,
        "revision": 1,
        "raft_term": 5
      },
      "version": "3.4.3",
      "dbSize": 49152,
      "leader": 8717530244098866000,
      "raftIndex": 392,
      "raftTerm": 5,
      "raftAppliedIndex": 392,
      "dbSizeInUse": 45056
    }
  },
  {
    "Endpoint": "https://node-3:2379",
    "Status": {
      "header": {
        "cluster_id": 4202817850163301400,
        "member_id": 8717530244098866000,
        "revision": 11,
        "raft_term": 5
      },
      "version": "3.4.3",
      "dbSize": 45056,
      "leader": 8717530244098866000,
      "raftIndex": 392,
      "raftTerm": 5,
      "raftAppliedIndex": 392,
      "dbSizeInUse": 32768
    }
  }
]

follower can received leader's command, we exclude cluster split brain and the implementation of raft algorithm bugs. why follower node auth revision is less than leader node?

we add debugging log and develop a simple chaos monkey tool to reproduce it. after running for a few days, we successfully reproduced. in our debugging log, we can see that the consistentIndex is repeated and apply some commands again when the etcd restarted:

We found that when executing auth command, the consistent index is not persistent. some commands(for example, GrantRolePermission) applyed again,it will also increase auth revision.

How to reproduce it (as minimally and precisely as possible):

start etcd
// enable auth
etcdctl user add root:xxx
etcdctl role add root
etcdctl user grant-role root root
etcdctl auth enable

etcdctl user add hello:world
etcdctl role add hello
etcdctl role grant-permission --prefix hello readwrite / (etcd role del hello/etcd user del user also can reproduce issue)
stop etcd
get auth_revision by using etcd-dump-db tool（we optimize the print format of auth bucket in this pr, you can use it to get better display）
cp -Rf /data/etcd.data etcd.bak; 
./etcd-dump-db iterate-bucket --decode etcd.bak auth
start etcd
get auth_revision by using etcd-dump-db tool

now you can found that the auth revision of this node have increased after restart etcd, though we didn't do any auth operation. and in other node which not be restarted, the auth revision is not changed.

After that, if leader auth revision is smaller than follower auth revision, follower will fail to apply command, and there won't be any error message in etcd log. then, the node will have inconsistent data and different revision, and you may get data from one node is ok but another one is not, like this:

How to fix it:

we will submit a pr to address this serious bug. it will persist consistentIndex into the backend store when executing auth commands.

Impact:

it is possible to encounter data inconsistency/loss for all etcd3 version when enable auth.

The above description is a little unclear. add the following description: whether a write request command can be executed successfully depends on which node the client is connected to. It has nothing to do with which is the leader.

for example. there are three nodes(A,B,C), A auth revision is 1, B is 2, C is 3.

client connected nodes\node apply entry result	A	B	C
A(write request auth revision:1)	SUCC	FAIL	FAIL
B(write request auth revision:2	SUCC	SUCC	FAIL
C(write request auth revision:3)	SUCC	SUCC	SUCC

if node A send write request,the request entry auth revision is 1, node B,C fail to apply entry command.

func (as *authStore) isOpPermitted(userName string, revision uint64, key, rangeEnd []byte, permTyp authpb.Permission_Type) error {
    // TODO(mitake): this function would be costly so we need a caching mechanism
    if !as.IsAuthEnabled() {
        return nil
    }

    // only gets rev == 0 when passed AuthInfo{}; no user given
    if revision == 0 {
        return ErrUserEmpty
    }

    if revision < as.Revision() {
        return ErrAuthOldRevision
    }

jingyih commented 4 years ago

cc @mitake

wswcfan commented 4 years ago

11643

9630

10407

this may be caused by the same bug

jingyih commented 4 years ago

Fixed by #11652.

jingyih commented 4 years ago

@tangcong Thanks again for fixing this! Could you expand the following statement, I am trying to understand if / how etcd fails to apply a command but reports no error. Does etcd client receive error when this happens?

After that, if leader auth revision is smaller than follower auth revision, follower will fail to apply command, and there won't be any error message in etcd log.

tangcong commented 4 years ago

func (a *applierV3backend) Apply(r *pb.InternalRaftRequest) *applyResult {
    ar := &applyResult{}
    defer func(start time.Time) {
        warnOfExpensiveRequest(a.s.getLogger(), start, &pb.InternalRaftStringer{Request: r}, ar.resp, ar.err)
    }(time.Now())

ar.err msg is "auth revision is old" when auth revision is inconsistent. i think it is necessary to optimize it by adding an error log and etcd_server_proposals_fail_applied_total metrics. how do you think? i can submit another pr to optimize it.

The client did not receive an error because the apply command for the node to which the client connected was successful.

@jingyih

jingyih commented 4 years ago

I see. So client may or may not see error "auth revision is old", depending on which etcd in the cluster serves it. It is the same raft log entry replicated to all 3 etcd members, but they apply it differently due to the fact that their auth store revisions are inconsistent. It feels like the inconsistency issue is amplified, in the sense that it starts with re-applying one or few auth related raft log entries, but later this leads to growing inconsistency in mvcc store.

Adding a log warning when "auth revision is old" happens sounds good to me.

Not sure about etcd_server_proposals_fail_applied_total metric. Technically, the raft entry is successfully applied (but differently on each server). I checked the existing metrics, there is etcd_mvcc_put_total. Maybe we could have something like etcd_debugging_mvcc_failed_put_total? debugging just means the metric is new and is not stable yet. Later we can choose to make it stable or remove it, depending on actually how useful it is. (EDIT: I just realized this is not failing in mvcc)

jingyih commented 4 years ago

When auth store inconsistency leads to inconsistency in mvcc store, we should be able to tell from mvcc metrics such as etcd_mvcc_put_total? @tangcong

tangcong commented 4 years ago

Yes. etcd_mvcc_put_total metric is useful, it is different in every etcd member when auth store is inconsistent. however, it is also different when there are many write requests. it is a little difficult for us to choose a reasonable alarm threshold when we configure alarm rules. @jingyih

jingyih commented 4 years ago

Yes. etcd_mvcc_put_total metric is useful, it is different in every etcd member when auth store is inconsistent. however, it is also different when there are many write requests. it is a little difficult for us to choose a reasonable alarm threshold when we configure alarm rules. @jingyih

Right, I understand etcd_mvcc_put_total is different on each member, depending on the progress of each member's apply. When auth store revision is corrupted on a member, will all the following key-value requests on this member be rejected due to auth revision is old? If this is the case, we can expect to see the that mvcc metric to diverge very fast?

jingyih commented 4 years ago

@tangcong Maybe also consider trying alpha feature --experimental-corrupt-check-time if you upgrade to v3.4.4 in future. It does periodic corruption checks among servers in a cluster.

tangcong commented 4 years ago

@jingyih yes.we can expect to see the that mvcc metric to diverge fast. At that time, our ETCD monitoring system issued an alert by comparing the number of KEYs at each node.

tangcong commented 4 years ago

When will this bugfix be ported to release 3.3 / release 3.4? @jingyih

tangcong commented 4 years ago

@jingyih i try to enable corruption check and find that it is very slow and often timed out when there are a lots of keys(for example,1 million keys). the timeout is unmodifiable. can the timeout parameter be configured?

10893 incremental corruption check sounds good.

jingyih commented 4 years ago

i try to enable corruption check and find that it is very slow and often timed out when there are a lots of keys(for example,1 million keys). the timeout is unmodifiable. can the timeout parameter be configured?

Unfortunately the timeout is not configurable for now. There should be 5+ seconds for each remote API call for fetching the hash from peers, is it not enough? Could you try etcdctl endpoint hashkv and see how long it takes?

10893 incremental corruption check sounds good.

Yes all ideas on making corruption check better in etcd are welcome.

tangcong commented 4 years ago

etcd version is 3.4.3, three nodes, initial corruption check takes 30 second.

14730644:Mar  4 00:46:39 localhost etcd[7346]: 2a111ffbb45ec018 starting initial corruption check with timeout 15s...
14730689:Mar  4 00:47:09 localhost etcd[7346]: 2a111ffbb45ec018 succeeded on initial corruption checking: no corruption
14730691:Mar  4 00:47:09 localhost etcd[7346]: enabled corruption checking with 3m0s interval

periodic corruption check also has error logs:

Mar 04 00:47:09 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.149:2380" with revision 395483904
Mar 04 00:46:54 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.53:2380" with revision 395483904

however,etcdctl endpoint hashkv is very fast~

root@VM-0-105-ubuntu:~/# time ETCDCTL_API=3 etcdctl  endpoint  hashkv --endpoints https://x.x.x.53:2379 --cluster
https://x.105:2379, 1938650550
https://x.53:2379, 885147713
https://x.149:2379, 1965074925
real    0m0.228s
user    0m0.032s
sys     0m0.004s

@jingyih

tangcong commented 4 years ago

etcd version is 3.4.3, three nodes, initial corruption check takes 30 second.

14730644:Mar  4 00:46:39 localhost etcd[7346]: 2a111ffbb45ec018 starting initial corruption check with timeout 15s...
14730689:Mar  4 00:47:09 localhost etcd[7346]: 2a111ffbb45ec018 succeeded on initial corruption checking: no corruption
14730691:Mar  4 00:47:09 localhost etcd[7346]: enabled corruption checking with 3m0s interval

periodic corruption check also has error logs:

Mar 04 00:47:09 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.149:2380" with revision 395483904
Mar 04 00:46:54 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.53:2380" with revision 395483904

however,etcdctl endpoint hashkv is very fast~

root@VM-0-105-ubuntu:~/# time ETCDCTL_API=3 etcdctl  endpoint  hashkv --endpoints https://x.x.x.53:2379 --cluster
https://x.105:2379, 1938650550
https://x.53:2379, 885147713
https://x.149:2379, 1965074925
real    0m0.228s
user    0m0.032s
sys     0m0.004s

@jingyih

i see. leader failed to connect other etcd member in getPeerHashKVs function. corruption check is not expensive when cluster has 1 million keys. you have fixed it in this pr #11636(v3.4.4). most of our cluster versions are now 3.3.17. In the future, we will try this feature after upgrading according to the actual situation. thanks.

jingyih commented 4 years ago

@tangcong Good to hear. Thanks for sharing the info.

liukun commented 3 years ago

Found the fix was at v3.4.8 but need v3.4.9 to successfully run a cluster.

etcd-io / etcd

a data corruption bug in all etcd3 version when authentication is enabled #11651

11643

9630

10407

10893 incremental corruption check sounds good.

10893 incremental corruption check sounds good.