Closed cockroach-teamcity closed 1 year ago
At first glance, this is an interesting failure. The test fails here:
It's literally just waiting for an epoch-based lease to expire. Shouldn't take 45s!
Looked at the logs. The test prints this when it introduces the partition that should wreck liveness
I230719 18:53:31.959585 2572 kv/kvserver_test/client_raft_test.go:1438 ⋮ [-] 319 test: waiting for lease expiration
and it definitely gets wrecked! There are tons of
E230719 18:54:06.071519 40658 kv/kvserver/replica_range_lease.go:471 ⋮ [T1,n1,s1,r9/1:‹/Table/{5-6}›] 6230 failed to heartbeat own liveness record: context canceled
and yet, 40s in,
I230719 18:54:06.071785 2572 testutils/soon.go:42 ⋮ [-] 6231 SucceedsSoon: lease still valid
This is a mostly vanilla TestCluster (mod an introduced network partition, etc) and so it ought to be on a regular clock (i.e. no manual clock that doesn't move over time).
Good news is, got a repro[^1]. Will add some logging, hopefully once we see the actual lease printed it's a little less mysterious.
[^1]: ./dev test --stress --filter TestRequestsOnFollowerWithNonLiveLeaseholder ./pkg/kv/kvserver/
on gceworker after 185s.
client_raft_test.go:1439: condition failed to evaluate within 45s: from client_raft_test.go:1442: lease still valid: {Lease:repl=(n3,s3):3 seq=2
start=1690197987.025104522,0 epo=1
pro=1690197987.025104522,0
Now:1690198028.622447513,0 RequestTime:1690198028.622447513,0 State:VALID ErrInfo: Liveness:liveness(nid:3 epo:1 exp:1690198030.317778663,0) MinValidObservedTimestamp:0,0}
The interesting part is n3
. I believe it should be n1:
Ah, duh, I was reading this wrong. We are checking n1, no doubt about that. But the lease is on n3 in the case in which the test fails. We're probably not managing to pin the lease appropriately.
kv/kvserver.TestRequestsOnFollowerWithNonLiveLeaseholder failed with artifacts on master @ 30acaf93015008a672667c0d3fdc00ac95f446a6:
Help
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
/cc @cockroachdb/replication
This test on roachdash | Improve this report!
Jira issue: CRDB-29943