cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.15k stars 3.81k forks source link

kv/kvserver: TestRequestsOnFollowerWithNonLiveLeaseholder failed #107200

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 1 year ago

kv/kvserver.TestRequestsOnFollowerWithNonLiveLeaseholder failed with artifacts on master @ 30acaf93015008a672667c0d3fdc00ac95f446a6:

=== RUN   TestRequestsOnFollowerWithNonLiveLeaseholder
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/7e19515d2bcbb456ac5289a12c95b9a2/logTestRequestsOnFollowerWithNonLiveLeaseholder1391283936
    test_log_scope.go:81: use -show-logs to present logs inline
    client_raft_test.go:1439: condition failed to evaluate within 45s: from client_raft_test.go:1442: lease still valid
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/7e19515d2bcbb456ac5289a12c95b9a2/logTestRequestsOnFollowerWithNonLiveLeaseholder1391283936
--- FAIL: TestRequestsOnFollowerWithNonLiveLeaseholder (46.79s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-29943

tbg commented 1 year ago

At first glance, this is an interesting failure. The test fails here:

https://github.com/cockroachdb/cockroach/blob/a851fe95f9715fd95fe25907e770514a7157d802/pkg/kv/kvserver/client_raft_test.go#L1438-L1446

It's literally just waiting for an epoch-based lease to expire. Shouldn't take 45s!

Looked at the logs. The test prints this when it introduces the partition that should wreck liveness

I230719 18:53:31.959585 2572 kv/kvserver_test/client_raft_test.go:1438 ⋮ [-] 319 test: waiting for lease expiration

and it definitely gets wrecked! There are tons of

E230719 18:54:06.071519 40658 kv/kvserver/replica_range_lease.go:471 ⋮ [T1,n1,s1,r9/1:‹/Table/{5-6}›] 6230 failed to heartbeat own liveness record: context canceled

and yet, 40s in,

I230719 18:54:06.071785 2572 testutils/soon.go:42 ⋮ [-] 6231 SucceedsSoon: lease still valid

This is a mostly vanilla TestCluster (mod an introduced network partition, etc) and so it ought to be on a regular clock (i.e. no manual clock that doesn't move over time).

tbg commented 1 year ago

Good news is, got a repro[^1]. Will add some logging, hopefully once we see the actual lease printed it's a little less mysterious.

[^1]: ./dev test --stress --filter TestRequestsOnFollowerWithNonLiveLeaseholder ./pkg/kv/kvserver/ on gceworker after 185s.

tbg commented 1 year ago
    client_raft_test.go:1439: condition failed to evaluate within 45s: from client_raft_test.go:1442: lease still valid: {Lease:repl=(n3,s3):3 seq=2
start=1690197987.025104522,0 epo=1
pro=1690197987.025104522,0
Now:1690198028.622447513,0 RequestTime:1690198028.622447513,0 State:VALID ErrInfo: Liveness:liveness(nid:3 epo:1 exp:1690198030.317778663,0) MinValidObservedTimestamp:0,0}

The interesting part is n3. I believe it should be n1:

https://github.com/cockroachdb/cockroach/blob/cff213d53ed1569acf35cf5b12a67fb810360713/pkg/kv/kvserver/client_raft_test.go#L1398-L1399

tbg commented 1 year ago

Ah, duh, I was reading this wrong. We are checking n1, no doubt about that. But the lease is on n3 in the case in which the test fails. We're probably not managing to pin the lease appropriately.