etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.69k stars 9.75k forks source link

Inconsistent revision and data occurs #13766

Closed liuycsd closed 2 years ago

liuycsd commented 2 years ago

I'm running a 3-member etcd cluster in a testing environment, and a k8s cluster with each kube-apiserver connecting to one etcd server via localhost.

etcd cluster: created and running with v3.5.1 k8s cluster: v1.22

I found a data inconsistent several days ago: Some keys exists on node x and y but not on node z, and some on z but not on the other two, e.g. different pods list would be returned from different servers. Some keys have different values between z and the others, and can be updated to another different value via the corresponding etcd endpoint, e.g. the kube-system/kube-controller-manager lease points to different pods on different servers and both pods can successfully update the lease via their corresponding local api-server and etcd. While other keys, including some new created ones, are consistent.

Checking with etcdctl endpoint status, raft_term, leader, raftIndex and raftAppliedIndex are same, but revision and dbSize does not: revision on node z is smaller than the others (~700000 diff) and both revisions are increasing.

Checking with etcd-dump-logs, it seems all the 3 nodes are receiving the same raft logs.

I'm not sure how and when this happens. But the nodes might be under loads and may run out of memory or disk spaces sometimes. Checking and comparing db keys with bbolt keys SNAPSHOT key, and seaching node operating system logs for the revisions near where the difference starts, I found some slow read logs mentions those revisions and dozon of leader elections and switches during those several hours. Besides, the disk space (including the etcd data dir) of node z may be full and the operating system logs was cleared so I don't know what happeds exactly and I'm not sure whether this relates to the issue or not.

serathius commented 2 years ago

cc @ptabor

ptabor commented 2 years ago

Was this cluster originally created as 3.5.1 or it was subject to migration from 3.4.x. or 3.3.x ?

serathius commented 2 years ago

Hey @liuycsd, can you provide etcd configuration you are running. Specifically it would be useful to know if you use --experimental-corrupt-check-time as would allow early detection.

liuycsd commented 2 years ago

Was this cluster originally created as 3.5.1 or it was subject to migration from 3.4.x. or 3.3.x ?

IIRC, it was originally created as 3.5.1.

liuycsd commented 2 years ago

Hey @liuycsd, can you provide etcd configuration you are running. Specifically it would be useful to know if you use --experimental-corrupt-check-time as would allow early detection.

I didn't set --experimental-corrupt-check-time.

liuycsd commented 2 years ago

Hey @liuycsd, can you provide etcd configuration you are running. Specifically it would be useful to know if you use --experimental-corrupt-check-time as would allow early detection.

Checking the running server, current config is: https serving and client and peer cert auth enabled quota-backend-bytes heartbeat-interval election-timeout auto-compaction-retention All in environment variables and no command line arguments.

chaochn47 commented 2 years ago

Hi @liuycsd, thanks for the reporting this issue.

But the nodes might be under loads and may run out of memory or disk spaces sometimes.

I found some slow read logs mentions those revisions and dozon of leader elections and switches during those several hours. Besides, the disk space (including the etcd data dir) of node z may be full and the operating system logs was cleared so I don't know what happeds exactly and I'm not sure whether this relates to the issue or not.

Looks like the data inconsistency triggers are

Can you help us better understand your cluster is set up?


Kindly suggest:

wilsonwang371 commented 2 years ago

I suggest you try 3.5.2 and see if this problem can still happen. Thre was a data corruption issue fixed recently.

liuycsd commented 2 years ago

Thanks.

That's just a simple environment for testing, with limited resources. No monitoring or backup setup. Just because someone deployed a heavy system using that k8s cluster and found it quite slow, after logging into those nodes, found it was out of resources and made some manually cleanup and kubectl get pods showed that it seems to works. But when I'm using that cluster several days later I found the inconsistent and take a snapshot AFTER that. No previous backup found.

I try to reproduce it in another 3.5.1 cluster, but it didn't occur. I setup a low cpu and disk quota to simulate a overloaded system so some leader elections were triggered randomly, at the same time, keep running several etcdctl compare/put txn, but the revisions inconsistent didn't occur.

I may try v3.5.2 and setup some basic monitor and backup later to see whether or not it occurs again.

Please use --experimental-corrupt-check-time as would allow early detection as @ptabor and @serathius said.

Are there any suggestions about what duration should I use for the corrupt-check-time and how can I know when it detects any corruption? I didn't find the detailed document about this.

liuycsd commented 2 years ago

Sorry for the accident close

liuycsd commented 2 years ago

Thanks @chaochn47 @wilsonwang371 and everyone else.

After checking the log again, I found a panic did occur on node y, and another panic occurred on node x two minutes later. From the restart log after panic, the revision from loaded db file was near (~10000 later than) where the revision difference starts by comparing the db with bbolt keys. I don't know whether or not it panicked on node z because the logs were cleared. It just panicked at kvsToEvents in mvcc.syncWatchers.

serathius commented 2 years ago

Can you share the stacktraces?

liuycsd commented 2 years ago

Can you share the stacktraces?

I don't have access to the cluster right now. I may paste it when it's available later. IIRC, the stacktraces are like #12845.

wilsonwang371 commented 2 years ago

can you check if #13505 is the root cause?

liuycsd commented 2 years ago

can you check if #13505 is the root cause?

How could I verify that?

The stacktraces are similar. On node y:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 7)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 173 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00f971800, 0xc01238bb00, 0x1, 0x1)
    /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000614870, 0x12268a2, 0x23, 0xc01238bb00, 0x1, 0x1)
    /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc000614870, 0xc0000fc720, 0xc00eb85000, 0x13c, 0x200, 0xc00eb82000, 0x13c, 0x200, 0x12, 0x7fffffffffffffff, ...)
    /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc0000fc6e0, 0x0)
    /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc0000fc6e0)
    /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
    /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

On node x:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 0)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 209 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00015d5c0, 0xc00841ffc0, 0x1, 0x1)
    /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0005e2d20, 0x12268a2, 0x23, 0xc00841ffc0, 0x1, 0x1)
    /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc0005e2d20, 0xc00042c360, 0xc0070cf000, 0x14c, 0x200, 0xc0070cc000, 0x14c, 0x200, 0x12, 0x7fffffffffffffff, ...)
    /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc00042c320, 0x0)
    /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc00042c320)
    /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
    /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
wilsonwang371 commented 2 years ago

You can use 3.5.2 and see if you can reproduce this issue. From the stack trace, it seems like #13505 fixed this issue and it is available in 3.5.2. @liuycsd

wilsonwang371 commented 2 years ago

can you check if #13505 is the root cause?

How could I verify that?

The stacktraces are similar. On node y:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 7)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 173 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00f971800, 0xc01238bb00, 0x1, 0x1)
  /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000614870, 0x12268a2, 0x23, 0xc01238bb00, 0x1, 0x1)
  /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc000614870, 0xc0000fc720, 0xc00eb85000, 0x13c, 0x200, 0xc00eb82000, 0x13c, 0x200, 0x12, 0x7fffffffffffffff, ...)
  /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc0000fc6e0, 0x0)
  /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc0000fc6e0)
  /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
  /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

On node x:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 0)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 209 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00015d5c0, 0xc00841ffc0, 0x1, 0x1)
  /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0005e2d20, 0x12268a2, 0x23, 0xc00841ffc0, 0x1, 0x1)
  /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc0005e2d20, 0xc00042c360, 0xc0070cf000, 0x14c, 0x200, 0xc0070cc000, 0x14c, 0x200, 0x12, 0x7fffffffffffffff, ...)
  /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc00042c320, 0x0)
  /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc00042c320)
  /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
  /tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

do you have any update on this ?

serathius commented 2 years ago

Please use --experimental-corrupt-check-time as would allow early detection as @ptabor and @serathius said.

Are there any suggestions about what duration should I use for the corrupt-check-time and how can I know when it detects any corruption? I didn't find the detailed document about this.

There is some performance overhead expected fro running it, I would recommend it to run every couple of hours.

serathius commented 2 years ago

I'm still investigating the data corruption issues that were reported for v3.5 and were not reproduced. Based on this issue I have been looking if https://github.com/etcd-io/etcd/pull/13505 could be the case of them. I would like to confirm that we can reproduce data corruption on v3.5.0 by triggering case from https://github.com/etcd-io/etcd/pull/13505 to then verify that v3.5.2 has fixed the data corruption.

However based on my discussion with @ptabor, we don't see any how mis-referenced bbolt page could trigger could cause the corruption. It could still be caused by generic segfault, however this makes it even harder to reproduce.

As discussed on last community meeting this problem was found by Bytedance, @wilsonwang371 could you shed some light on how this issue was detected and if it was accompanied by data corruption?

wilsonwang371 commented 2 years ago

I'm still investigating the data corruption issues that were reported for v3.5 and were not reproduced. Based on this issue I have been looking if #13505 could be the case of them. I would like to confirm that we can reproduce data corruption on v3.5.0 by triggering case from #13505 to then verify that v3.5.2 has fixed the data corruption.

However based on my discussion with @ptabor, we don't see any how mis-referenced bbolt page could trigger could cause the corruption. It could still be caused by generic segfault, however this makes it even harder to reproduce.

As discussed on last community meeting this problem was found by Bytedance, @wilsonwang371 could you shed some light on how this issue was detected and if it was accompanied by data corruption?

We found data inconsistency in our internal etcd cluster. This was found in 3.4 but this is happening more often after we start using 3.5 which has a better performance.

When this happens, a solution would be take down the error node for some time and bring it back again later. This will force the data resync inside the cluster and the problem will be gone.

However, when this happens, no alert or special message was given. It gives the SRE team a hard time to detect this. Luckily, they were able to dump the log and I found there is a panic within the time range that we found issue. Other than that, there is no other error messages discovered. The content of the panic is very similar to the one described in this thread. Basically, somewhere in kvsToEvents(), a panic was triggered.

With some investigation, we believed the panic should be an error either from reading boltdb or from reading memory. From my point of view, it should be a data corruption inside boltdb.

Later in #13505, we found that the boltdb was accessed without a lock protection. Without a proper lock held, we can corrupt boltdb data and this can definitely trigger a data corruption in etcd. That's why I believe this patch should be able to fix this data corruption issue.

PaulFurtado commented 2 years ago

We encountered this corruption in one kube etcd cluster running etcd 3.5.1 yesterday morning and then upgraded all of our clusters to 3.5.2 today because we were going to be doing a rolling-restart of the kube api servers, which causes dangerously high memory spikes on etcd.

We faced etcd OOMs in two of our clusters when performing this rolling update. Both of those etcd clusters were running 3.5.2 and both experienced this corruption. The kube symptom of this is that querying various objects returned different results, probably depending on which etcd server was hit. This resulted in errors when listing objects like:

Error from server: etcdserver: mvcc: required revision is a future revision

and cases where querying the same object multiple times in a row altered between a recent state and one from 10+ minutes before.

On the etcd side, the most obvious symptom was the wildly different database sizes between members.

So, I can confirm that corruption problems definitely still exist in 3.5.2, and that the experimental corruption check flag may need some more work to detect it.

The servers in the first two cases were i3en.2xlarge ec2 instances (8 CPU, 64GiB memory, high iops nvme SSDs). The servers in the final case were i3en.6xlarge (24CPU, 192GiB memory). For an idea of load, every member in that i3en.6xlarge cluster pretty consistently does 8-12gbps of etcd traffic, and the 2xlarge clusters around 3-5gbps. It seems like our kube workload reproduces it after just a couple of etcd OOMs, so a reproducer could possibly just be issuing kill -9 to the etcd members repeatedly while performing a stress test until the db sizes vary widely, although the system behavior under memory pressure could also be relevant (heavily thrashing io causing the server to come to a crawl and leader elections occurring just prior to OOM).

Please let me know if any additional information would be valuable for you here.

serathius commented 2 years ago

Thanks @PaulFurtado sharing, this is hugely informative. There are 3 important pieces of information:

I will look into reproducing this myself, couple of questions about corruption check. It is possible that corruption check run, but don't report an issue due to high WAL latency of followers. To get more details about why it's not working can you:

Would be also good to know if you have seen same problem with v3.4.X. If you have a environment for reproducing the issue it would be good to confirm that problem only affects v3.5. It would be also useful to eliminate that the problem is with any optional v3.5 feature, can you confirm provide etcd configuration you use?

PaulFurtado commented 2 years ago

@serathius sorry, I had mentioned the --experimental-corrupt-check-time flag, but I actually meant the --experimental-initial-corrupt-check flag. When we believed it to be corrupt, we enabled the flag and this is what was logged upon restart:

{"level":"info","ts":"2022-03-23T21:49:02.865Z","caller":"etcdserver/corrupt.go:46","msg":"starting initial corruption check","local-member-id":"d9ccd33e39ffddb","timeout":"7s"}
{"level":"info","ts":"2022-03-23T21:49:06.242Z","caller":"etcdserver/corrupt.go:116","msg":"initial corruption checking passed; no corruption","local-member-id":"d9ccd33e39ffddb"}

We have never experienced this issue with 3.4.x (or if we did, it must have been extremely subtle and never caused us pain). Under 3.4.x we experienced OOM kills constantly, and had faced so many outages due to OOMs so I would expect that we definitely would have noticed it on 3.4.x. We have been running on 3.5.x for a bit over a month now and this is the first time we hit an OOM condition and we immediately hit this corruption, so it definitely seems like a 3.5 issue.

These are the exact flags from one of the i3en.6xlarge servers:

--logger zap
--log-outputs stderr
--log-level info
--listen-peer-urls https://0.0.0.0:2380
--listen-client-urls https://0.0.0.0:2379
--advertise-client-urls https://172.16.60.166:2379
--data-dir /usr/share/hubspot/etcd/data
--peer-trusted-ca-file /etc/pki/tls/certs/etcd-peer-ca.crt
--peer-cert-file /etc/pki/tls/certs/etcd-peer.crt
--peer-key-file /etc/pki/tls/private/etcd-peer.key
--peer-client-cert-auth
--trusted-ca-file /etc/pki/tls/certs/etcd-client-ca.crt
--client-cert-auth
--cert-file /etc/pki/tls/certs/etcd-client.crt
--key-file /etc/pki/tls/private/etcd-client.key
--auto-compaction-retention 1h
--heartbeat-interval 100
--election-timeout 15000
--quota-backend-bytes 17179869184
--snapshot-count 100000
--max-request-bytes 15728640
--enable-v2
--experimental-peer-skip-client-san-verification

I also doubt that it's relevant, but we run with these env vars as well:

ETCD_NAME=na1-brave-street
ETCD_INITIAL_CLUSTER_STATE=existing
ETCD_INITIAL_CLUSTER=default=http://localhost:2380
ETCD_INITIAL_CLUSTER_TOKEN=kube-iad02
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://172.16.60.166:2380
GODEBUG=x509ignoreCN=0

I will also mention that we are running kubernetes 1.21 and we are aware that it is intended to talk to an etcd 3.4.x cluster. However, we upgraded to etcd 3.5 because it is impossible to run etcd with stable latency and memory usage on 3.4.x with kube clusters of our size. Is it possible that something about the 3.4.x client could be corrupting the 3.5.x server?

chaochn47 commented 2 years ago

I am not able to reproduce it with a 3 distributed etcd node cluster backing a kubernetes 1.22 cluster.

The reproduction set up is generating load from a standby node using benchmark put, range and txn-mixed. The 3 node keeps oom_reaped, recovered from the local disk and the revision is monotonically increasing across the cluster. The revision divergence is minimal and slow nodes will catch up eventually.

@PaulFurtado

Thanks!

On my end, I will try to rerun the reproduction process with etcd cluster upgrading/downgrading between 3.5 and 3.4.

serathius commented 2 years ago

I think reproduced the issue, it was surprisingly easy when I knew where to look. Based on @PaulFurtado comment I looked into simulating highly stressed etcd and sending SIGKILL signal to members one by one. When looking into our functional tests I found it very strange that we don't already have such test. We have tests with SIGTERM_* and SIGQUIT_AND_REMOVE_FOLLOWER, however we don't just test if database is correctly restored after unrecoverable error.

I have added a new tests (SIGKILL_FOLLOWER, SIGKILL_LEADER) and increased the stress-qps. This was enough to cause data inconsistency. As functional tests run with --experimental-initial-corrupt-check, killed member fails to join with message checkInitialHashKV failed. This doesn't answer question how in @PaulFurtado case this check was not triggered, but should be enough to show there is a problem.

To make results repeatable I have modified functional tests to inject the failure repeatably for some time. I managed to get 100% chance of reproduction for both test scenarios with 8000 qps within 1 minute of running. Issue seems to be only happen in higher qps scenarios, with lower chance of reproduction with 4000 qps and no reproductions with 2000 qps.

With that I looked into using same method to test v3.4.18 release. I didn't managed to get any corruptions even when running for 10 minutes with 8000 qps. I didn't test with higher qps as this is limit of my workstation, however this should be enough to confirm that issue is only on v3.5.X.

I'm sharing the code I used for reproduction here https://github.com/etcd-io/etcd/pull/13838. I will be looking to root causing the data inconsistency first, and later redesigning our functional tests as they seem to be not fulfilling their function.

PaulFurtado commented 2 years ago

@serathius great job, thanks!

As for why the corruption check may not have worked for us: when we believed that etcd was corrupt, we restarted the processes, defragmented, then restarted again and tried turning the corruption check on. I'm wondering if the defragment effectively took the corrupt data and committed it as good data?

lavalamp commented 2 years ago

It is extremely bad that etcd can get into this state and then continue. I have written about one way to fix this in #13839.

wilsonwang371 commented 2 years ago

I think I found reproduction, it was surprisingly easy when I knew where to look. Based on @PaulFurtado I looked into simulating highly stressed etcd and sending SIGKILL signal to members one by one. When looking into our functional tests I found it very strange that we don't already have such test. We have tests with SIGTERM_* and SIGQUIT_AND_REMOVE_FOLLOWER, however we don't just test if database is correctly restored after unrecoverable error.

I have added a new tests (SIGKILL_FOLLOWER, SIGKILL_LEADER) and increased the stress-qps. This was enough to cause data inconsistency. As functional tests run with --experimental-initial-corrupt-check, killed member fails to join with message checkInitialHashKV failed. This doesn't answer question how in @PaulFurtado case this check was not triggered, but should be enough to show there is a problem.

To make results repeatable I have modified functional tests to inject the failure repeatably for some time. I managed to get 100% chance of reproduction for both test scenarios with 8000 qps within 1 minute of running. Issue seems to be only happen in higher qps scenarios, with lower chance of reproduction with 4000 qps and no reproductions with 2000 qps.

With that I looked into using same method to test v3.4.18 release. I didn't managed to get any corruptions even when running for 10 minutes with 8000 qps. I didn't test with higher qps as this is limit of my workstation, however this should be enough to confirm that issue is only on v3.5.X.

I'm sharing the code I used for reproduction here #13838. I will be looking to root causing the data inconsistency first, and later redesigning our functional tests as they seem to be not fulfilling their function.

To locate the issue the patch that introduced this issue, we can use binary search. This can take some time.

wilsonwang371 commented 2 years ago

btw, there could be several different factors that can trigger this data inconsistency issue. They are not necessarily the same root cause. With 3.5.2, i didn't get complaints from our internal SRE team anymore about the data inconsistency. So the one that we fixed in 3.5.2 might be only one of the data inconsistency issues.

serathius commented 2 years ago

git bisect has lead me to https://github.com/etcd-io/etcd/commit/50051675f9740a4561e13bc5f00a89982b5202ad being the root cause. It's not 100% sure as reproduction has some flakiness. PR with the commit https://github.com/etcd-io/etcd/pull/12855

serathius commented 2 years ago

Send https://github.com/etcd-io/etcd/pull/13840 with one idea to fix the corruption.

ahrtr commented 2 years ago

When a PUT starts, both the transaction and store are locked, see kvstore_txn.go#L134-L136; and when it's finished, both of them are unlocked, see kvstore_txn.go#L175-L179.

While etcd calls back the interface method OnPreCommitUnsafe in the periodic commit, it should follow the same workflow as a PUT, but it only locks the transaction (the store isn't locked at all), see batch_tx.go#L294. This could be the root cause.

liuycsd commented 2 years ago

can you check if #13505 is the root cause?

How could I verify that? The stacktraces are similar.

do you have any update on this ?

Thanks a lot for reproducing and checking for the root cause.

Sorry for later response. I didn't see the inconsistency on that cluster after upgrading: I restored all those 3 member nodes with a snapshot backup from one member, upgraded them from v3.5.1 to v3.5.2 and enabled hourly --experimental-corrupt-check-time about one week ago. Yesterday I checked it again and didn't see the revision inconsistency and etcdctl alarm list reported no alarm, although that small cluster might not have high qps for writing to etcd or kube-apiserver. I may try to reproduce it again when I have some free time later.

ahrtr commented 2 years ago

Another potential issue is the OnPreCommitUnsafe saves both consistent_index and term sequentially, see cindex.go#L70-L81. But another goroutine may set the consistent_index at the same time concurrently. Accordingly the persisted consistent_index may not be correct in the scenario of high I/O and frequent leader changing.

If the etcd crashes for whatever reason immediately after the incorrect consistent_index is persisted, then when etcd starts again, it loads the incorrect values. Eventually the etcdserver will be in a bad situation.

Let's work with an imaginary example,

  1. The etcdserver is in progress of an apply workflow, and it just sets a new consistent_index value, assuming it's 51 for now, and the old value is 50;
  2. The periodic commit is triggered, and it starts to call the OnPreCommitUnsafe, and the consistent_index (51 for now) is persisted into the db; and the periodic commit also finishes;
  3. The etcd crashes for whatever reason. So the apply workflow doesn't finish.
  4. When the etcd server starts again, it loads the incorrect consistent_index value (51), which should be 50.
  5. When etcd starts to apply a new entry, it may not apply v3 store because the e.Index <= index. Eventually the data is inconsistent between members.
ptabor commented 2 years ago

Thank you @ahrtr: I think that your story is valid. The code was written with assumption that the changes to backend are committed in alignment with the apply() methods and the commit does not happens in the middle of the application. I.e. after setConsistencyIndex is called, the following Unlock() will be executed as final part of finishing transaction in the same apply(). The place where it might break if we have multiple transactions within the same Apply() so we might update the consistency index between them:

Thinking whether we should have explicitly 2 types of applies (with code-level verification):

  1. Reentrant: We can apply the same transaction multiple times on the backend without any consequences. For them being updated without bumping consistency index is not a problem. Such applies might be split into multiple transactions. Only the last one (or more likely an additional - artificial transaction) can update consistencyIndex. If we interrupt in the middle - it's OK.
  2. Fully atomic: Such transaction needs to have exactly one transaction / Unlock.
liuycsd commented 2 years ago

I started that cluster with the broken data again, with ETCD_EXPERIMENTAL_INITIAL_CORRUPT_CHECK=true and ETCD_EXPERIMENTAL_CORRUPT_CHECK_TIME=10m to see whether it could detect the corruption. But it didn't: It just reports mvcc: required revision is a future revision in logs and etcdctl alarm list gives empty result.

grep corrupt.go on node x

{"level":"info","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:46","msg":"starting initial corruption check","local-member-id":"id-x-hex","timeout":"11s"}
{"level":"warn","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:289","msg":"failed hash kv request","local-member-id":"id-x-hex","requested-revision":5x8xxx89,"remote-peer-endpoint":"https://node-y:2380","error":"etcdserver: mvcc: required revision is a future revision"}
{"level":"warn","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:289","msg":"failed hash kv request","local-member-id":"id-x-hex","requested-revision":5x8xxx89,"remote-peer-endpoint":"https://node-z:2380","error":"etcdserver: mvcc: required revision is a future revision"}
{"level":"info","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:116","msg":"initial corruption checking passed; no corruption","local-member-id":"id-x-hex"}
{"level":"info","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:130","msg":"enabled corruption checking","local-member-id":"id-x-hex","interval":"10m0s"}
{"level":"warn","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:377","msg":"failed to get hashKV","requested-revision":5x4yyyy2,"error":"mvcc: required revision has been compacted"}
{"level":"warn","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:377","msg":"failed to get hashKV","requested-revision":5x4yyyy2,"error":"mvcc: required revision has been compacted"}
{"level":"warn","ts":"xxxx-xx-xxTxx:1x:xx.xxxZ","caller":"etcdserver/corrupt.go:289","msg":"failed hash kv request","local-member-id":"id-x-hex","requested-revision":5x8xxx90,"remote-peer-endpoint":"https://node-y:2380","error":"etcdserver: mvcc: required revision is a future revision"}
{"level":"warn","ts":"xxxx-xx-xxTxx:1x:xx.xxxZ","caller":"etcdserver/corrupt.go:289","msg":"failed hash kv request","local-member-id":"id-x-hex","requested-revision":5x8xxx90,"remote-peer-endpoint":"https://node-z:2380","error":"etcdserver: mvcc: required revision is a future revision"}
{"level":"info","ts":"xxxx-xx-xxTxx:1x:xx.xxxZ","caller":"etcdserver/corrupt.go:244","msg":"finished peer corruption check","number-of-peers-checked":0}

grep corrupt.go on node y

{"level":"info","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:46","msg":"starting initial corruption check","local-member-id":"id-y-hex","timeout":"11s"}
{"level":"warn","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:289","msg":"failed hash kv request","local-member-id":"id-y-hex","requested-revision":5x4yyyy2,"remote-peer-endpoint":"https://node-z:2380","error":"Get \"https://node-z:2380/members/hashkv\": net/http: request canceled"}
{"level":"warn","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:289","msg":"failed hash kv request","local-member-id":"id-y-hex","requested-revision":5x4yyyy2,"remote-peer-endpoint":"https://node-x:2380","error":"Get \"https://node-x:2380/members/hashkv\": context deadline exceeded"}
{"level":"info","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:116","msg":"initial corruption checking passed; no corruption","local-member-id":"id-y-hex"}
{"level":"info","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:130","msg":"enabled corruption checking","local-member-id":"id-y-hex","interval":"10m0s"}
{"level":"warn","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:377","msg":"failed to get hashKV","requested-revision":5x8xxx89,"error":"mvcc: required revision is a future revision"}
{"level":"warn","ts":"xxxx-xx-xxTxx:1x:xx.xxxZ","caller":"etcdserver/corrupt.go:377","msg":"failed to get hashKV","requested-revision":5x8xxx90,"error":"mvcc: required revision is a future revision"}

grep corrupt.go on node z

{"level":"info","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:46","msg":"starting initial corruption check","local-member-id":"id-z-hex","timeout":"11s"}
{"level":"warn","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:289","msg":"failed hash kv request","local-member-id":"id-z-hex","requested-revision":5x4yyyy2,"remote-peer-endpoint":"https://node-y:2380","error":"Get \"https://node-y:2380/members/hashkv\": net/http: request canceled"}
{"level":"warn","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:289","msg":"failed hash kv request","local-member-id":"id-z-hex","requested-revision":5x4yyyy2,"remote-peer-endpoint":"https://node-x:2380","error":"Get \"https://node-x:2380/members/hashkv\": context deadline exceeded"}
{"level":"info","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:116","msg":"initial corruption checking passed; no corruption","local-member-id":"id-z-hex"}
{"level":"info","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:130","msg":"enabled corruption checking","local-member-id":"id-z-hex","interval":"10m0s"}
{"level":"warn","ts":"xxxx-xx-xxTxx:0x:xx.xxxZ","caller":"etcdserver/corrupt.go:377","msg":"failed to get hashKV","requested-revision":5x8xxx89,"error":"mvcc: required revision is a future revision"}
{"level":"warn","ts":"xxxx-xx-xxTxx:1x:xx.xxxZ","caller":"etcdserver/corrupt.go:377","msg":"failed to get hashKV","requested-revision":5x8xxx90,"error":"mvcc: required revision is a future revision"}

endpoint status --cluster

[
{"Endpoint":"https://node-z:2379","Status":{"header":{"cluster_id":id-a,"member_id":id-z-dec,"revision":5x4yyyy4,"raft_term":xx},"version":"3.5.2","dbSize":3yyyyyyyy, "leader":id-x-dec,"raftIndex":6xxxxxxx,"raftTerm":xx,"raftAppliedIndex":6xxxxxxx,"dbSizeInUse":8yyyzzzz}},
{"Endpoint":"https://node-y:2379","Status":{"header":{"cluster_id":id-a,"member_id":id-y-dec,"revision":5x4yyyy4,"raft_term":xx},"version":"3.5.2","dbSize":3yyyyyyyy, "leader":id-x-dec,"raftIndex":6xxxxxxx,"raftTerm":xx,"raftAppliedIndex":6xxxxxxx,"dbSizeInUse":8yyyyyyy}},
{"Endpoint":"https://node-x:2379","Status":{"header":{"cluster_id":id-a,"member_id":id-x-dec,"revision":5x8xxx90,"raft_term":xx},"version":"3.5.2","dbSize":17xxxxxxxx,"leader":id-x-dec,"raftIndex":6xxxxxxx,"raftTerm":xx,"raftAppliedIndex":6xxxxxxx,"dbSizeInUse":6xxxxxxx}}
]
serathius commented 2 years ago

This shows a flaw in current implementation of corruption check, we cannot verify hashes if members revisions differ to much. So if the corruption happen long before check was enabled we are not able to detect the corruption.

liuycsd commented 2 years ago

This shows a flaw in current implementation of corruption check, we cannot verify hashes if members revisions differ to much. So if the corruption happen long before check was enabled we are not able to detect the corruption.

Thanks for your explanation.

ptabor commented 2 years ago

liuycsd@ thank you - that's a valid hole.

At the same time we cannot 'blindly' assume mvcc: required revision is a future revision means that there is inconsistency. It might happen that the remote server has not yet applied all transactions from the RAFT/Wal log and is lagging. In normal read paths etcd deal with this issue by linear reads, i.e. before sending response to any request, a local instance checks whether all RAFT transactions queued at the time of the read were received and got applied: by injecting raft-read-barrier. Seems like hashKVHandler (server/etcdserver/corrupt.go:351) could be wrapped in similar logic:

shalinmangar commented 2 years ago

Reading the explanation at https://github.com/etcd-io/etcd/issues/13766#issuecomment-1079843692 and https://github.com/etcd-io/etcd/issues/13766#issuecomment-1080632932, it sounds like this bug is triggered only when:

  1. A commit succeeds between an apply operation
  2. The apply operation had multiple transactions
  3. The apply operation could not complete because the etcd process crashed or it was forcefully killed

In such cases, parts of the batch transaction may become visible upon restart thus violating the atomicity constraint. Can someone confirm if my understanding is correct/complete?

5. When etcd starts to apply a new entry, it may not apply v3 store because the e.Index <= index. Eventually the data is inconsistent between members.

I am trying to wrap my head around this comment. In the given example, I can see how atomicity was violated after restart (because consistent index is 51 instead of 50) but can someone explain why new entries (presumably coming from new leader?) may be older than the consistent_index (51)?

Coming back to the issue of commits happening between apply, I can see the following trigger a commit:

  1. If BackendBatchInterval or BackendBatchLimit is configured (they are unlimited by default)
  2. Maintenance requests such as: snapshot, auth enabled/disabled, compaction
  3. Auto-compaction

Are there more cases?

Also, are there steps that we can take to reduce the likelihood of running into this issue beyond serathius's message on the mailing list? We run 3.5.1 if that helps.

shalinmangar commented 2 years ago

This shows a flaw in current implementation of corruption check, we cannot verify hashes if members revisions differ to much. So if the corruption happen long before check was enabled we are not able to detect the corruption.

Does anyone have ideas on how to build a better corruption check? Where should I start if I am interested in spending some time on this?

lavalamp commented 2 years ago

better corruption check

Yes, see https://github.com/etcd-io/etcd/issues/13839

ahrtr commented 2 years ago

better corruption check Yes, see https://github.com/etcd-io/etcd/issues/13839

Agree, the current corruption check actually has flaw, please read comment.

serathius commented 2 years ago

liuycsd@ thank you - that's a valid hole.

@ptabor should we plan some improvement to inconsistency checking for v3.5.3 release? I would want to make the check more reliable, however I'm not sure if there is enough room for improvement so we can backport to v3.5.

ahrtr commented 2 years ago

FYI. in case anyone is interested in a simple summary