Closed aphyr closed 2 years ago
I'm not sure if this is precisely the same bug or a different one, but I've managed to reproduce a crash with a slightly different error message purely from process pauses--no compaction or defrag operations involved. Here's the full logs.
{"level":"panic","ts":"2022-06-13T17:45:54.601Z","caller":"etcdserver/apply.go:638","msg":"unexpected error during txn","error":"context canceled","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638\ngo.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617\ngithub.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"}
{"level":"warn","ts":"2022-06-13T17:45:54.603Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"6.263437345s","expected-duration":"100ms","prefix":"read-only txn ","request":"compare:<> success:<request_range:<key:\"2783\" > request_range:<key:\"2785\" > > failure:<>","response":""}
{"level":"info","ts":"2022-06-13T17:45:54.603Z","caller":"traceutil/trace.go:171","msg":"trace[2082798263] transaction","detail":"{read_only:true; }","duration":"6.26467016s","start":"2022-06-13T17:45:48.338Z","end":"2022-06-13T17:45:54.603Z","steps":["trace[2082798263] 'compare' (duration: 6.261489853s)"],"step_count":1}
panic: unexpected error during txn
goroutine 32669 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0030dda40, 0xc0031dbd40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc00040c3c0, 0x1220287, 0x1b, 0xc0031dbd40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn(0xc0003dc9d8, 0x13a1cb8, 0xc003c4e540, 0x13ab818, 0xc0031be650, 0xc002dc5f80, 0xc0028095f0, 0x1, 0x1, 0xc0032e0f50, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638 +0x930
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn(0xc0003dc9d8, 0x13a1cb8, 0xc003c4e540, 0xc002dc5f80, 0x2, 0x48, 0x493c26, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477 +0x38a
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3()
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181 +0x70
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize(0xc000464300, 0x13a1cb8, 0xc003c4e540, 0xc0003094a8, 0xc0003094d8, 0xc0031f2d70, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659 +0xf8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn(0xc000464300, 0x13a1cb8, 0xc003c4e540, 0xc002dc5f80, 0x0, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182 +0x40e
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn(0xc0003ef300, 0x13a1cb8, 0xc003c4e3f0, 0xc002dc5f80, 0xc002dc5f80, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95 +0xfc
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn(0xc0003ef400, 0x13a1cb8, 0xc003c4e3f0, 0xc002dc5f80, 0xc0003ef400, 0x13a8a68, 0xc0002913e0)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70 +0xc2
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x14, 0xc0031f2b90, 0x2, 0xc000340818)
/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617 +0x89
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc000627308, 0xc00046da20, 0x0, 0x0, 0x0)
/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x1ab2d20, 0xc00061e600, 0x8bda0c, 0x13881c0)
/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc003228080, 0x493c26, 0x62a777cc, 0x14323d4a, 0xf6fa2f301f7)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71 +0x15c
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x0, 0x203000, 0x13b1c70, 0xc0022fbc20)
/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc0032280a0, 0x0, 0x0, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78 +0xd9
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x1a82cc0, 0x0, 0xc000340b30, 0x40e318)
/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc000627308, 0xc000340ba0, 0x573fa6, 0x1158b40, 0xc003c4e3f0)
/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xd7
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler(0x10fa540, 0xc0003ef400, 0x13a1cb8, 0xc003c4e3f0, 0xc0021f2000, 0xc0002f15f0, 0x13a1cb8, 0xc003c4e3f0, 0xc003a187c8, 0x14)
/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619 +0x150
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320, 0xc0002f1740, 0x1a6e268, 0x0, 0x0, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc003ca7820, 0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320)
/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:932 +0x1fd
I think this is a good catch and a relevant bug. TLTR: etcd transaction shouldn't panic based on client context errors (timeouts, cancelations)
Here is how this happened:
panic
in transactions from very early days. https://github.com/etcd-io/etcd/pull/3534/files#diff-26b2effd11719aa79af125ed4e17234ef08970839d4f83fb4d802e04614c3cd4R161I think minimal fix should be to use empty context for tracing in txn (this is a quick change). Maybe we should have special class for tracing context to prevent confusion.
But Ideally we should match behavior of readonly operations wrapped in serializable transactions with normal readonly operations. Correct me if I'm wrong here.
This is a valid issue.
Based on current implementation, once a transaction starts, it must finishes without any error. Otherwise, it will end up with partially committed data. The workaround (the current implementation) is to crash the process to prevent it (partially commit) from happening.
Probably we should use the BoltDB transaction to perform the transaction. But it would need big refactoring.
@ahrtr I think this issue is only for readonly serializable transactions that aren't going through raftRequestOnce
. For normal transaction this won't happen because context isn't passed down to applier
Yes, this case happens for readonly (not necessary to be serializable) transaction. But the applyTxn is a generic implementation, and it doesn't know it's coming from readonly transaction.
Note that we can't use empty context (context.TODO()
) either, otherwise the readonly transaction will not timeout.
Probably we have two options, one for short-term, and one for long-term,
Please anyone feel free to deliver a PR for the short-term fix for now.
Yes, this case happens for readonly (not necessary to be serializable) transaction. But the applyTxn is a generic implementation, and it doesn't know it's coming from readonly transaction.
Note that we can't use empty context (
context.TODO()
) either, otherwise the readonly transaction will not timeout.Probably we have two options, one for short-term, and one for long-term,
- short-term fix: add a panic-recover to catch the panic for the readonly transaction case. We need to evaluate the impact.
That sounds right to me - as long as it's truly RO code-paths. I think we could do this as grpc interceptor.
- long-term fix: Refactor the existing transaction design/implemention to use BoltDB transaction, and rollback the transaction on error. Just as I mentioned in previous comment, this may need big effort, and accordingly has big impact.
We do batching of the bolt transactions to minimize numbers of b-tree nodes that are being rewritten at the top of the tree. If we did true 1:1 transactions it would have negative performance impact.
As Context is an interface (https://pkg.go.dev/context#Context), I would propose implementing a wrapper around it that overrides the:
Here is a different approach - create a separate apply function for readonly txn. It's basically a partial copy of current applyTxn for range but without recursion, because isTxnReadonly will return false for nested readonly tnx (maybe it's a bug).
I think this approach might be a bit more flexible for future changes to readonly txn. And no need to run recover.
If this looks good, I can add some tests.
We are also hitting this fairy regularly (Several times per day) on both v3.5.2 & v3.5.4.
We are also hitting this fairy regularly (Several times per day) on both v3.5.2 & v3.5.4.
Would you mind provide the complete log and the endpoint info (see command below)?
etcdctl endpoint status -w json --cluster
Could you also explain how do you produce this issue? Did you see this issue in a testing environment or a production environment?
What's the volume of the request per second?
@ahrtr endpoint info:
[{"Endpoint":"https://etcd-501-armada-stage5-south-sbhnpvg72b.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":5440534516375719315,"revision":5729930287,"raft_term":12923},"version":"3.5.4","dbSize":1037332480,"leader":8376786025734177010,"raftIndex":4073315546,"raftTerm":12923,"raftAppliedIndex":4073315546,"dbSizeInUse":963850240}},{"Endpoint":"https://etcd-501-armada-stage5-south-5zflxmmcz2.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":8376786025734177010,"revision":5729930289,"raft_term":12923},"version":"3.5.4","dbSize":1037058048,"leader":8376786025734177010,"raftIndex":4073315551,"raftTerm":12923,"raftAppliedIndex":4073315551,"dbSizeInUse":963796992}},{"Endpoint":"https://etcd-501-armada-stage5-south-7n25f4hms2.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":11064341971799022067,"revision":5729930301,"raft_term":12923},"version":"3.5.4","dbSize":1040326656,"leader":8376786025734177010,"raftIndex":4073315567,"raftTerm":12923,"raftAppliedIndex":4073315567,"dbSizeInUse":963985408}},{"Endpoint":"https://etcd-501-armada-stage5-south-prfhrxqx5k.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":11815188087669886612,"revision":5729930309,"raft_term":12923},"version":"3.5.4","dbSize":1037373440,"leader":8376786025734177010,"raftIndex":4073315585,"raftTerm":12923,"raftAppliedIndex":4073315585,"dbSizeInUse":964038656}},{"Endpoint":"https://etcd-501-armada-stage5-south-m5zlm8l8l6.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":12614509488604193654,"revision":5729930313,"raft_term":12923},"version":"3.5.4","dbSize":1040461824,"leader":8376786025734177010,"raftIndex":4073315590,"raftTerm":12923,"raftAppliedIndex":4073315590,"dbSizeInUse":964067328}}]
Log -> etcd_panic.log.zip
Request rate is in total around 1000 req/sec - it's a very varied workload with many Microservices making different requests against etcd. I'm afraid I don't know exactly what triggers it to occur. The workload runs constantly, and should be fairly similar in nature at all times - I don't see any big spikes in requests or response times or anything when we see the panic. We have 5 member cluster and it seems to affect all 5 members.
This is our performance/scale test environment, so we are holding off moving to prod with v3.5 until this is resolved.
If it helps narrow things down, the workload I'm using that crashes etcd is kv reads, writes, and transactions over an exponentially-distributed pool of keys. Roughly 500-1000 ops/sec as well, though I haven't looked to see if it crashes with fewer.
Thanks @aphyr , we already know the root cause in your case. I just want to double confirm whether @mcginne ran into the same issue, and was caused by the same reason. Will take a look at the log provided by @mcginne today or tomorrow.
Thanks @mcginne . Confirmed that it's the same issue, and we will fix this issue soon in both main (3.6) and 3.5. Hopefully the fix will be included in 3.5.5.
FYI, I'll post a PR for main with the approach outlined here If somebody else is working on the fix, please state here, so there is no duplication of effort.
I'll try to get this in tomorrow June 22
Thank you @ahrtr and @lavacat for confirming and the work on the fix - much appreciated.
@serathius @ptabor reminder to review the PRs main https://github.com/etcd-io/etcd/pull/14149 port to 3.5 https://github.com/etcd-io/etcd/pull/14178
What happened?
In a Jepsen test run of five etcd 3.5.3 nodes, with process pauses (sending processes SIGSTOP and SIGCONT), compaction (performed via the admin API), and defragmentation (via etcdctl), one etcd process crashed with the following error message:
This is the first time this has cropped up, so I'm not sure whether pauses, compaction, and defrag are all necessary to trigger the bug. The only nemesis operations on this particular node before the crash appear to be pause & resume, but I imagine it's possible that other nodes performing compaction or defragmentation might have caused them to exchange (or fail to exchange!) messages with this node which caused it to crash
Here's the full logs and data files from the cluster: 20220613T140950.000Z.zip.
What did you expect to happen?
I expected etcd not to crash.
How can we reproduce it (as minimally and precisely as possible)?
Check out jepsen-etcd 84d7d54698c387ed467dd5dfb8ca4bebc2ee46d5, and with a five-node cluster, run:
I'm not sure how long this will take to reproduce yet--still collecting evidence.
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
These node IDs aren't going to match, because I collected them from a different test run, but it should give you a general idea of the cluster topology:
Note that since these tests involve process pauses, it's tough to reach in and grab a coherent cluster view--pretty much every attempt I made involved timeouts.
Relevant log output
No response