etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.69k stars 9.75k forks source link

panic: unexpected error during txn #14110

Closed aphyr closed 2 years ago

aphyr commented 2 years ago

What happened?

In a Jepsen test run of five etcd 3.5.3 nodes, with process pauses (sending processes SIGSTOP and SIGCONT), compaction (performed via the admin API), and defragmentation (via etcdctl), one etcd process crashed with the following error message:

{"level":"panic","ts":"2022-06-13T14:11:27.870Z","caller":"etcdserver/apply.go:638","msg":"unexpected error during txn","error":"context canceled","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638\ngo.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617\ngithub.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"}
{"level":"info","ts":"2022-06-13T14:11:27.875Z","caller":"traceutil/trace.go:171","msg":"trace[430686560] transaction","detail":"{read_only:true; }","duration":"5.770188852s","start":"2022-06-13T14:11:22.105Z","end":"2022-06-13T14:11:27.875Z","steps":["trace[430686560] 'agreement among raft nodes before linearized reading'  (duration: 5.763577986s)"],"step_count":1}
panic: unexpected error during txn

goroutine 10896 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00011ec00, 0xc0035542c0, 0x1, 0x1)
    /go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc00048c280, 0x1220287, 0x1b, 0xc0035542c0, 0x1, 0x1)
    /go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn(0xc00000d1e8, 0x13a1cb8, 0xc002fe7cb0, 0x13ab818, 0xc003cfe550, 0xc003d07030, 0xc003b2ab00, 0x1, 0x1, 0xc003cec730, ...)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638 +0x930
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn(0xc00000d1e8, 0x13a1cb8, 0xc002fe7cb0, 0xc003d07030, 0x2, 0x48, 0x493c26, 0x2)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477 +0x38a
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3()
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181 +0x70
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize(0xc000442a00, 0x13a1cb8, 0xc002fe7cb0, 0xc0030314a8, 0xc0030314d8, 0xc003e1c500, 0x0)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659 +0xf8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn(0xc000442a00, 0x13a1cb8, 0xc002fe7cb0, 0xc003d07030, 0x0, 0x0, 0x0)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182 +0x40e
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn(0xc000324fc0, 0x13a1cb8, 0xc002fe7b60, 0xc003d07030, 0xc003d07030, 0x0, 0x0)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95 +0xfc
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn(0xc000325000, 0x13a1cb8, 0xc002fe7b60, 0xc003d07030, 0xc000325000, 0x13a8a68, 0xc0002b7e60)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70 +0xc2
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0x14, 0xc003b89d60, 0x2, 0xc0000f2818)
    /go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617 +0x89
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a718a8, 0xc003d746e0, 0x0, 0x0, 0x0)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0x1ab2d20, 0xc000437200, 0x8bda0c, 0x13881c0)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a7bae0, 0x493c26, 0x62a7458a, 0x6450608, 0x3bbfb531619)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71 +0x15c
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0x0, 0x203000, 0x13b1c70, 0xc003e10710)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a7bb00, 0x0, 0x0, 0x0, 0x0)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78 +0xd9
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc00021c800, 0x0, 0xc0000f2b30, 0x40e318)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a718a8, 0xc0000f2ba0, 0x573fa6, 0x1158b40, 0xc002fe7b60)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xd7
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler(0x10fa540, 0xc000325000, 0x13a1cb8, 0xc002fe7b60, 0xc003e18000, 0xc0003098c0, 0x13a1cb8, 0xc002fe7b60, 0xc003a9f640, 0x9)
    /go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619 +0x150
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004e7340, 0x13ae2d8, 0xc00379ec00, 0xc003e17200, 0xc000309a10, 0x1a6e268, 0x0, 0x0, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc0004e7340, 0x13ae2d8, 0xc00379ec00, 0xc003e17200, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc002a90770, 0xc0004e7340, 0x13ae2d8, 0xc00379ec00, 0xc003e17200)
    /go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:932 +0x1fd

This is the first time this has cropped up, so I'm not sure whether pauses, compaction, and defrag are all necessary to trigger the bug. The only nemesis operations on this particular node before the crash appear to be pause & resume, but I imagine it's possible that other nodes performing compaction or defragmentation might have caused them to exchange (or fail to exchange!) messages with this node which caused it to crash

Here's the full logs and data files from the cluster: 20220613T140950.000Z.zip.

What did you expect to happen?

I expected etcd not to crash.

How can we reproduce it (as minimally and precisely as possible)?

Check out jepsen-etcd 84d7d54698c387ed467dd5dfb8ca4bebc2ee46d5, and with a five-node cluster, run:

lein run test-all --username admin --nodes-file /home/admin/nodes --concurrency 4n --rate 500 --time-limit 300 --workload append --nemesis pause,admin --test-count 20

I'm not sure how long this will take to reproduce yet--still collecting evidence.

Anything else we need to know?

No response

Etcd version (please run commands below)

```console $ etcd --version etcd Version: 3.5.3 Git SHA: 0452feec7 Go Version: go1.16.15 Go OS/Arch: linux/amd64 $ etcdctl version etcdctl version: 3.5.3 API version: 3.5 ```

Etcd configuration (command line flags or environment variables)

etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380/ --listen-client-urls http://192.168.122.101:2379/ --advertise-client-urls http://192.168.122.101:2379/ --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380/ --initial-cluster n1=[http://192.168.122.101:2380,n2=http://192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380](http://192.168.122.101:2380%2Cn2%3Dhttp//192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380) --snapshot-count 100 --experimental-initial-corrupt-check

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

These node IDs aren't going to match, because I collected them from a different test run, but it should give you a general idea of the cluster topology:

```console admin@ip-172-31-9-53:/opt/etcd$ ./etcdctl --endpoints http://172.31.9.53:2379 member list -w table +------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+ | 6f0318b69646b10a | started | ec2-54-234-45-183.compute-1.amazonaws.com | http://172.31.13.117:2380 | http://172.31.13.117:2379 | false | | 851bb95a115a0651 | started | ec2-54-90-123-74.compute-1.amazonaws.com | http://172.31.9.53:2380 | http://172.31.9.53:2379 | false | | 8ce655fee64e8888 | started | ec2-3-88-142-40.compute-1.amazonaws.com | http://172.31.7.132:2380 | http://172.31.7.132:2379 | false | | 9a228ee3e4899e1e | started | ec2-54-167-119-185.compute-1.amazonaws.com | http://172.31.9.60:2380 | http://172.31.9.60:2379 | false | | dfed32f696db96d2 | started | ec2-54-204-125-79.compute-1.amazonaws.com | http://172.31.12.135:2380 | http://172.31.12.135:2379 | false | +------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+ admin@ip-172-31-9-53:/opt/etcd$ {"level":"warn","ts":"2022-06-13T14:48:00.119Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004fa1c0/n1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Failed to get the status of endpoint http://n4:2379 (context deadline exceeded) +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | http://n1:2379 | 851bb95a115a0651 | 3.5.3 | 164 kB | false | false | 2 | 2302 | 2299 | | | http://n2:2379 | 6f0318b69646b10a | 3.5.3 | 164 kB | false | false | 3 | 2306 | 2306 | | | http://n5:2379 | 8ce655fee64e8888 | 3.5.3 | 168 kB | false | false | 3 | 4966 | 4966 | | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ {"level":"warn","ts":"2022-06-13T14:47:38.147Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002fee00/n1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Failed to get the status of endpoint http://n4:2379 (cont ```

Note that since these tests involve process pauses, it's tough to reach in and grab a coherent cluster view--pretty much every attempt I made involved timeouts.

Relevant log output

No response

aphyr commented 2 years ago

I'm not sure if this is precisely the same bug or a different one, but I've managed to reproduce a crash with a slightly different error message purely from process pauses--no compaction or defrag operations involved. Here's the full logs.

{"level":"panic","ts":"2022-06-13T17:45:54.601Z","caller":"etcdserver/apply.go:638","msg":"unexpected error during txn","error":"context canceled","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638\ngo.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617\ngithub.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"}
{"level":"warn","ts":"2022-06-13T17:45:54.603Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"6.263437345s","expected-duration":"100ms","prefix":"read-only txn ","request":"compare:<> success:<request_range:<key:\"2783\" >  request_range:<key:\"2785\" > > failure:<>","response":""}
{"level":"info","ts":"2022-06-13T17:45:54.603Z","caller":"traceutil/trace.go:171","msg":"trace[2082798263] transaction","detail":"{read_only:true; }","duration":"6.26467016s","start":"2022-06-13T17:45:48.338Z","end":"2022-06-13T17:45:54.603Z","steps":["trace[2082798263] 'compare'  (duration: 6.261489853s)"],"step_count":1}
panic: unexpected error during txn

goroutine 32669 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0030dda40, 0xc0031dbd40, 0x1, 0x1)
    /go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc00040c3c0, 0x1220287, 0x1b, 0xc0031dbd40, 0x1, 0x1)
    /go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn(0xc0003dc9d8, 0x13a1cb8, 0xc003c4e540, 0x13ab818, 0xc0031be650, 0xc002dc5f80, 0xc0028095f0, 0x1, 0x1, 0xc0032e0f50, ...)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638 +0x930
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn(0xc0003dc9d8, 0x13a1cb8, 0xc003c4e540, 0xc002dc5f80, 0x2, 0x48, 0x493c26, 0x2)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477 +0x38a
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3()
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181 +0x70
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize(0xc000464300, 0x13a1cb8, 0xc003c4e540, 0xc0003094a8, 0xc0003094d8, 0xc0031f2d70, 0x0)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659 +0xf8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn(0xc000464300, 0x13a1cb8, 0xc003c4e540, 0xc002dc5f80, 0x0, 0x0, 0x0)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182 +0x40e
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn(0xc0003ef300, 0x13a1cb8, 0xc003c4e3f0, 0xc002dc5f80, 0xc002dc5f80, 0x0, 0x0)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95 +0xfc
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn(0xc0003ef400, 0x13a1cb8, 0xc003c4e3f0, 0xc002dc5f80, 0xc0003ef400, 0x13a8a68, 0xc0002913e0)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70 +0xc2
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x14, 0xc0031f2b90, 0x2, 0xc000340818)
    /go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617 +0x89
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc000627308, 0xc00046da20, 0x0, 0x0, 0x0)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x1ab2d20, 0xc00061e600, 0x8bda0c, 0x13881c0)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc003228080, 0x493c26, 0x62a777cc, 0x14323d4a, 0xf6fa2f301f7)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71 +0x15c
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x0, 0x203000, 0x13b1c70, 0xc0022fbc20)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc0032280a0, 0x0, 0x0, 0x0, 0x0)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78 +0xd9
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x1a82cc0, 0x0, 0xc000340b30, 0x40e318)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc000627308, 0xc000340ba0, 0x573fa6, 0x1158b40, 0xc003c4e3f0)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xd7
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler(0x10fa540, 0xc0003ef400, 0x13a1cb8, 0xc003c4e3f0, 0xc0021f2000, 0xc0002f15f0, 0x13a1cb8, 0xc003c4e3f0, 0xc003a187c8, 0x14)
    /go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619 +0x150
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320, 0xc0002f1740, 0x1a6e268, 0x0, 0x0, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc003ca7820, 0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320)
    /go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:932 +0x1fd 
lavacat commented 2 years ago

I think this is a good catch and a relevant bug. TLTR: etcd transaction shouldn't panic based on client context errors (timeouts, cancelations)

Here is how this happened:

  1. Etcd had panic in transactions from very early days. https://github.com/etcd-io/etcd/pull/3534/files#diff-26b2effd11719aa79af125ed4e17234ef08970839d4f83fb4d802e04614c3cd4R161
  2. Then we've added an optimization to bypass raft for serializable transactions. https://github.com/etcd-io/etcd/pull/4956/files#diff-26b2effd11719aa79af125ed4e17234ef08970839d4f83fb4d802e04614c3cd4R91
  3. Then it was limited to only readOnly transactions https://github.com/etcd-io/etcd/pull/6543/files#diff-583048be54c04f9555a60191ec8767757fc7f0c477d5bc7f2191af8a3c1bf920R173
  4. Next we've started passing context and using it for recording tracing info. https://github.com/etcd-io/etcd/pull/11491/files#diff-583048be54c04f9555a60191ec8767757fc7f0c477d5bc7f2191af8a3c1bf920R172
  5. Lastly we've added context checking inside range implementation https://github.com/etcd-io/etcd/pull/12459/files#diff-c4842260ff90edd130afd1d3ac19cb782c4a59c98483070a7a7f817e4d2f13fdR151

I think minimal fix should be to use empty context for tracing in txn (this is a quick change). Maybe we should have special class for tracing context to prevent confusion.

But Ideally we should match behavior of readonly operations wrapped in serializable transactions with normal readonly operations. Correct me if I'm wrong here.

ahrtr commented 2 years ago

This is a valid issue.

Based on current implementation, once a transaction starts, it must finishes without any error. Otherwise, it will end up with partially committed data. The workaround (the current implementation) is to crash the process to prevent it (partially commit) from happening.

Probably we should use the BoltDB transaction to perform the transaction. But it would need big refactoring.

lavacat commented 2 years ago

@ahrtr I think this issue is only for readonly serializable transactions that aren't going through raftRequestOnce. For normal transaction this won't happen because context isn't passed down to applier

ahrtr commented 2 years ago

Yes, this case happens for readonly (not necessary to be serializable) transaction. But the applyTxn is a generic implementation, and it doesn't know it's coming from readonly transaction.

Note that we can't use empty context (context.TODO()) either, otherwise the readonly transaction will not timeout.

Probably we have two options, one for short-term, and one for long-term,

  1. short-term fix: add a panic-recover to catch the panic for the readonly transaction case. We need to evaluate the impact.
  2. long-term fix: Refactor the existing transaction design/implemention to use BoltDB transaction, and rollback the transaction on error. Just as I mentioned in previous comment, this may need big effort, and accordingly has big impact.
ahrtr commented 2 years ago

Please anyone feel free to deliver a PR for the short-term fix for now.

ptabor commented 2 years ago

Yes, this case happens for readonly (not necessary to be serializable) transaction. But the applyTxn is a generic implementation, and it doesn't know it's coming from readonly transaction.

Note that we can't use empty context (context.TODO()) either, otherwise the readonly transaction will not timeout.

Probably we have two options, one for short-term, and one for long-term,

  1. short-term fix: add a panic-recover to catch the panic for the readonly transaction case. We need to evaluate the impact.

That sounds right to me - as long as it's truly RO code-paths. I think we could do this as grpc interceptor.

  1. long-term fix: Refactor the existing transaction design/implemention to use BoltDB transaction, and rollback the transaction on error. Just as I mentioned in previous comment, this may need big effort, and accordingly has big impact.

We do batching of the bolt transactions to minimize numbers of b-tree nodes that are being rewritten at the top of the tree. If we did true 1:1 transactions it would have negative performance impact.

As Context is an interface (https://pkg.go.dev/context#Context), I would propose implementing a wrapper around it that overrides the:

lavacat commented 2 years ago

Here is a different approach - create a separate apply function for readonly txn. It's basically a partial copy of current applyTxn for range but without recursion, because isTxnReadonly will return false for nested readonly tnx (maybe it's a bug).

I think this approach might be a bit more flexible for future changes to readonly txn. And no need to run recover.

If this looks good, I can add some tests.

https://github.com/etcd-io/etcd/pull/14113/

mcginne commented 2 years ago

We are also hitting this fairy regularly (Several times per day) on both v3.5.2 & v3.5.4.

ahrtr commented 2 years ago

We are also hitting this fairy regularly (Several times per day) on both v3.5.2 & v3.5.4.

Would you mind provide the complete log and the endpoint info (see command below)?

etcdctl endpoint status -w json --cluster

Could you also explain how do you produce this issue? Did you see this issue in a testing environment or a production environment?

What's the volume of the request per second?

mcginne commented 2 years ago

@ahrtr endpoint info:

[{"Endpoint":"https://etcd-501-armada-stage5-south-sbhnpvg72b.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":5440534516375719315,"revision":5729930287,"raft_term":12923},"version":"3.5.4","dbSize":1037332480,"leader":8376786025734177010,"raftIndex":4073315546,"raftTerm":12923,"raftAppliedIndex":4073315546,"dbSizeInUse":963850240}},{"Endpoint":"https://etcd-501-armada-stage5-south-5zflxmmcz2.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":8376786025734177010,"revision":5729930289,"raft_term":12923},"version":"3.5.4","dbSize":1037058048,"leader":8376786025734177010,"raftIndex":4073315551,"raftTerm":12923,"raftAppliedIndex":4073315551,"dbSizeInUse":963796992}},{"Endpoint":"https://etcd-501-armada-stage5-south-7n25f4hms2.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":11064341971799022067,"revision":5729930301,"raft_term":12923},"version":"3.5.4","dbSize":1040326656,"leader":8376786025734177010,"raftIndex":4073315567,"raftTerm":12923,"raftAppliedIndex":4073315567,"dbSizeInUse":963985408}},{"Endpoint":"https://etcd-501-armada-stage5-south-prfhrxqx5k.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":11815188087669886612,"revision":5729930309,"raft_term":12923},"version":"3.5.4","dbSize":1037373440,"leader":8376786025734177010,"raftIndex":4073315585,"raftTerm":12923,"raftAppliedIndex":4073315585,"dbSizeInUse":964038656}},{"Endpoint":"https://etcd-501-armada-stage5-south-m5zlm8l8l6.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":12614509488604193654,"revision":5729930313,"raft_term":12923},"version":"3.5.4","dbSize":1040461824,"leader":8376786025734177010,"raftIndex":4073315590,"raftTerm":12923,"raftAppliedIndex":4073315590,"dbSizeInUse":964067328}}]

Log -> etcd_panic.log.zip

Request rate is in total around 1000 req/sec - it's a very varied workload with many Microservices making different requests against etcd. I'm afraid I don't know exactly what triggers it to occur. The workload runs constantly, and should be fairly similar in nature at all times - I don't see any big spikes in requests or response times or anything when we see the panic. We have 5 member cluster and it seems to affect all 5 members.

This is our performance/scale test environment, so we are holding off moving to prod with v3.5 until this is resolved.

aphyr commented 2 years ago

If it helps narrow things down, the workload I'm using that crashes etcd is kv reads, writes, and transactions over an exponentially-distributed pool of keys. Roughly 500-1000 ops/sec as well, though I haven't looked to see if it crashes with fewer.

ahrtr commented 2 years ago

Thanks @aphyr , we already know the root cause in your case. I just want to double confirm whether @mcginne ran into the same issue, and was caused by the same reason. Will take a look at the log provided by @mcginne today or tomorrow.

ahrtr commented 2 years ago

Thanks @mcginne . Confirmed that it's the same issue, and we will fix this issue soon in both main (3.6) and 3.5. Hopefully the fix will be included in 3.5.5.

lavacat commented 2 years ago

FYI, I'll post a PR for main with the approach outlined here If somebody else is working on the fix, please state here, so there is no duplication of effort.

I'll try to get this in tomorrow June 22

mcginne commented 2 years ago

Thank you @ahrtr and @lavacat for confirming and the work on the fix - much appreciated.

lavacat commented 2 years ago

@serathius @ptabor reminder to review the PRs main https://github.com/etcd-io/etcd/pull/14149 port to 3.5 https://github.com/etcd-io/etcd/pull/14178