cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.03k stars 3.8k forks source link

kv/kvserver: TestClosedTimestampCantServeWithConflictingIntent failed #101824

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 1 year ago

kv/kvserver.TestClosedTimestampCantServeWithConflictingIntent failed with artifacts on master @ a4ab6b6243f509e62993b4d81855f23470d9901a:

=== RUN   TestClosedTimestampCantServeWithConflictingIntent
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent1436482389
    test_log_scope.go:79: use -show-logs to present logs inline
    closed_timestamp_test.go:351: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/closed_timestamp_test.go:351
            Error:          Received unexpected error:
                            [NotLeaseHolderError] lease held by different store; r63: replica (n1,s1):1 not lease holder; current lease is repl=(n2,s2):3 seq=2 start=1681901373.794041544,0 exp=1681901379.793263093,0 pro=1681901373.793263093,0
            Test:           TestClosedTimestampCantServeWithConflictingIntent
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent1436482389
--- FAIL: TestClosedTimestampCantServeWithConflictingIntent (40.01s)

Parameters: TAGS=bazel,gss,race

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

Same failure on other branches

- #100053 kv/kvserver: TestClosedTimestampCantServeWithConflictingIntent failed [C-test-failure O-robot branch-release-22.1]

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-27142

cockroach-teamcity commented 1 year ago

kv/kvserver.TestClosedTimestampCantServeWithConflictingIntent failed with artifacts on master @ 3b1bc872bb4bf8e5eb2ad8790ba8ec781167dd1f:

=== RUN   TestClosedTimestampCantServeWithConflictingIntent
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent3629078817
    test_log_scope.go:79: use -show-logs to present logs inline
    closed_timestamp_test.go:337: request unexpectedly returned, should block; err: [NotLeaseHolderError] lease held by different store; r64: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):2 seq=2 start=1682681331.525769256,0 exp=1682681337.524812851,0 pro=1682681331.524812851,0
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent3629078817
--- FAIL: TestClosedTimestampCantServeWithConflictingIntent (40.36s)

Parameters: TAGS=bazel,gss,race

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

Same failure on other branches

- #100053 kv/kvserver: TestClosedTimestampCantServeWithConflictingIntent failed [C-test-failure O-robot branch-release-22.1]

This test on roachdash | Improve this report!

cockroach-teamcity commented 1 year ago

kv/kvserver.TestClosedTimestampCantServeWithConflictingIntent failed with artifacts on master @ 4619767bb0b75ac85ddcb76e33c73211e369afed:

=== RUN   TestClosedTimestampCantServeWithConflictingIntent
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent2101771649
    test_log_scope.go:79: use -show-logs to present logs inline
    closed_timestamp_test.go:351: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/closed_timestamp_test.go:351
            Error:          Received unexpected error:
                            [NotLeaseHolderError] lease held by different store; r64: replica (n1,s1):1 not lease holder; current lease is repl=(n2,s2):3 seq=2 start=1682767595.564162603,0 exp=1682767601.563077486,0 pro=1682767595.563077486,0
            Test:           TestClosedTimestampCantServeWithConflictingIntent
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent2101771649
--- FAIL: TestClosedTimestampCantServeWithConflictingIntent (40.83s)

Parameters: TAGS=bazel,gss,race

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

Same failure on other branches

- #100053 kv/kvserver: TestClosedTimestampCantServeWithConflictingIntent failed [C-test-failure O-robot branch-release-22.1]

This test on roachdash | Improve this report!

cockroach-teamcity commented 1 year ago

kv/kvserver.TestClosedTimestampCantServeWithConflictingIntent failed with artifacts on master @ 69875da00208dd792cb92aee54811e8a660773a9:

=== RUN   TestClosedTimestampCantServeWithConflictingIntent
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent464149897
    test_log_scope.go:79: use -show-logs to present logs inline
    closed_timestamp_test.go:351: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/closed_timestamp_test.go:351
            Error:          Received unexpected error:
                            [NotLeaseHolderError] lease held by different store; r64: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=2 start=1684314268.194372668,0 exp=1684314274.193422745,0 pro=1684314268.193422745,0
            Test:           TestClosedTimestampCantServeWithConflictingIntent
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent464149897
--- FAIL: TestClosedTimestampCantServeWithConflictingIntent (34.98s)

Parameters: TAGS=bazel,gss,race

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

Same failure on other branches

- #100053 kv/kvserver: TestClosedTimestampCantServeWithConflictingIntent failed [C-test-failure O-robot branch-release-22.1]

This test on roachdash | Improve this report!

erikgrinaker commented 1 year ago

All of these failures use expiration-based leases under race, and probably aren't able to maintain the leases. Submitting a PR to extend the lease duration.

tbg commented 1 year ago

In https://teamcity.cockroachdb.com/viewLog.html?buildId=10875307&tab=buildResultsDiv&buildTypeId=Cockroach_Ci_Tests_Testrace on https://github.com/cockroachdb/cockroach/pull/106145

=== RUN   TestClosedTimestampCantServeWithConflictingIntent
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent1610110799
    test_log_scope.go:81: use -show-logs to present logs inline
    closed_timestamp_test.go:342: request unexpectedly returned, should block; err: [NotLeaseHolderError] lease held by different store; r64: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=2 start=1689125159.587330435,0 exp=1689125219.586585395,0 pro=1689125159.586585395,0
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent1610110799
--- FAIL: TestClosedTimestampCantServeWithConflictingIntent (35.37s)

Note that exp-start is 60s, and the test still flakes.

erikgrinaker commented 1 year ago

@andrewbaptist Are you actively working on this? If not, should repl pick it up?

andrewbaptist commented 1 year ago

Repl can pick this up if you have time.

tbg commented 1 year ago

The error is here on the last line:

https://github.com/cockroachdb/cockroach/blob/193e75aa2cc6b12c583ed4da80236c5408c942b5/pkg/kv/kvserver/closed_timestamp_test.go#L330-L342

That is, we just^1 follower-read successfully at MinTimestamp.Prev() on all replicas and are now expecting to block when we read at MinTimestamp (since there's an intent, presumably). However, instead of that we get a NLHE.

I can imagine two reasons why that is: either the txn got aborted (however the test sets a long txn liveness threshold to avoid this^2) or there was some lease change that now made the follower read impossible in the first place (so it never got to bump into the intent).

Some notes on the logs^3 of this test: we spend >25s waiting for full replication. Then, pretty much exactly when the test also starts quiescing, we see a split

I230712 01:25:59.582913 1451694 kv/kvserver/replica_command.go:439 ⋮ [T1,n1,split,s1,r63/1:‹/Table/{62-106/1/0}›] 411 initiating a split of this range at key /Table/106 [r65] (‹span config›)

but note that our range is r64, so this is not anywhere our test reads (and even if it were, I don't know how it would be relevant).

I was unable to repro easily on my gceworker with

./dev test --race --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/ 2>&1 | tee stress.log
48 runs so far, 0 failures, over 5m40s
tbg commented 1 year ago

Ah nevermind this explains it

I230712 01:25:59.586438 1451847 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/106/1/0-Max}›] 1104 transferring lease to s3

The test expects leases to be stable. But its using ReplicationAuto so leases can be transferred - another case of https://github.com/cockroachdb/cockroach/issues/107524 though we can be less obviously incorrect in this test.

Note that https://github.com/cockroachdb/cockroach/issues/107179 also uses the helper that sets up this cluster, so it too might see errant lease transfers, perhaps they are sharing a root cause.