Closed cockroach-teamcity closed 8 months ago
I'm not sure what went wrong here. The main symptom I see is that there is a lease ping-ponging between s1 and s2 thousands of times:
I240126 02:38:14.421128 11976 13@kv/kvserver/allocator/plan/replicate.go:1055 ⋮ [T1,Vsystem,n1,replicate,s1,r20/1:‹/Table/1{7-8}›] 3499 transferring lease to s2
I240126 02:38:14.421146 11976 13@kv/kvserver/replicate_queue.go:1076 ⋮ [T1,Vsystem,n1,replicate,s1,r20/1:‹/Table/1{7-8}›] 3500 transferring lease to s2
I240126 02:38:14.424352 11980 13@kv/kvserver/allocator/plan/replicate.go:1055 ⋮ [T1,Vsystem,n2,replicate,s2,r20/2:‹/Table/1{7-8}›] 3501 transferring lease to s1
I240126 02:38:14.424368 11980 13@kv/kvserver/replicate_queue.go:1076 ⋮ [T1,Vsystem,n2,replicate,s2,r20/2:‹/Table/1{7-8}›] 3502 transferring lease to s1
Moving to KV-Replication to see if they have any theories for why that could happen.
cc @cockroachdb/replication
The cluster sets RF=1 with 2 nodes, then observes pretty extreme thrashing.
This occurs after rebalancing (with RF=1) towards either n1 or n2, and we end up with 2 replicas for the range. The leaseholder then hits:
Which will attempt to transfer the lease away first, before down-replicating back to RF=1. Its surprising the lease transfers are occurring so quickly, they should be waiting at least a second -- however maybeTransferLeaseAwayTarget
doesn't pass in CanTransferLeaseAway(..)
, so it doesn't observe the min lease transfer interval:
I wonder how cycle is supposed to end usually:
I'll try out some simulations.
20 minute simulated run with 100 ranges RF=1 and 2 nodes:
100.00 ┼──────╮ ╭───╮
93.33 ┤ ╰╮ ╭╯ ╰────╮
86.67 ┤ │ │ │
80.00 ┤ ╰╮ ╭╯ ╰╮ ╭─────╮
73.33 ┤ │ │ │ ╭╯ ╰─╮ ╭───╮
66.67 ┤ ╰╮╭╯ ╰╮ ╭╯ ╰╮ ╭╯ ╰─╮ ╭──╮
60.00 ┤ ││ │ │ │ │ ╰╮╭╯ ╰──╮ ╭──╮
53.33 ┤ ╭╯ ╰╮╯ ╰╭╯ │╯ ╰╭╯ ╰──╮───────╭───────
46.67 ┤ │╮ ╭╰╮ ╭╯╮ ╰╮ ╭╯╮ ╭──╰───────╯───────
40.00 ┤ ││ │ │ │ │ ╭╯╰╮ ╭──╯ ╰──╯
33.33 ┤ ╭╯╰╮ ╭╯ ╰╮ ╭╯ ╰╮ ╭─╯ ╰──╯
26.67 ┤ │ │ │ ╰╮ ╭─╯ ╰───╯
20.00 ┤ ╭╯ ╰╮ ╭╯ ╰─────╯
13.33 ┤ │ │ │
6.67 ┤ ╭╯ ╰╮ ╭────╯
0.00 ┼──────╯ ╰───╯
lease
It does stabilize (not completely) but takes a while. In this test, it could be worthwhile setting a higher RF, or not using 2 nodes.
This is readily reproducible using a local roachprod cluster as well:
I will file an issue and open a PR to change this test to use 3 nodes instead.
acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ d272e9ef5589deff570efc023db6c70edfde311c:
=== RUN TestDockerCLI_test_demo_networking
test_log_scope.go:170: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking1763553455
test_log_scope.go:81: use -show-logs to present logs inline
generated_cli_test.go:138: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking1763553455
--- FAIL: TestDockerCLI_test_demo_networking (88.93s)
=== RUN TestDockerCLI_test_demo_networking/test_demo_networking.tcl
util_docker.go:257: non-zero exit code: 1
--- FAIL: TestDockerCLI_test_demo_networking/test_demo_networking.tcl (77.23s)
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ 6b88dec162f403f6d35fee3796f55bc2bffe4c38:
=== RUN TestDockerCLI_test_demo_networking
test_log_scope.go:170: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking3042734063
test_log_scope.go:81: use -show-logs to present logs inline
generated_cli_test.go:138: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking3042734063
--- FAIL: TestDockerCLI_test_demo_networking (86.99s)
=== RUN TestDockerCLI_test_demo_networking/test_demo_networking.tcl
util_docker.go:257: non-zero exit code: 1
--- FAIL: TestDockerCLI_test_demo_networking/test_demo_networking.tcl (78.69s)
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ 6b88dec162f403f6d35fee3796f55bc2bffe4c38:
=== RUN TestDockerCLI_test_demo_networking
test_log_scope.go:170: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking2458737924
test_log_scope.go:81: use -show-logs to present logs inline
generated_cli_test.go:138: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking2458737924
--- FAIL: TestDockerCLI_test_demo_networking (88.32s)
=== RUN TestDockerCLI_test_demo_networking/test_demo_networking.tcl
util_docker.go:257: non-zero exit code: 1
--- FAIL: TestDockerCLI_test_demo_networking/test_demo_networking.tcl (78.32s)
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ 6b88dec162f403f6d35fee3796f55bc2bffe4c38:
=== RUN TestDockerCLI_test_demo_networking
test_log_scope.go:170: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking2423139381
test_log_scope.go:81: use -show-logs to present logs inline
generated_cli_test.go:138: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking2423139381
--- FAIL: TestDockerCLI_test_demo_networking (70.84s)
=== RUN TestDockerCLI_test_demo_networking/test_demo_networking.tcl
util_docker.go:257: non-zero exit code: 1
--- FAIL: TestDockerCLI_test_demo_networking/test_demo_networking.tcl (68.04s)
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ 9a3b3c95b20b8dc10ce0f5e8184d4dfe3aa69689:
=== RUN TestDockerCLI_test_demo_networking
test_log_scope.go:170: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking3837640790
test_log_scope.go:81: use -show-logs to present logs inline
generated_cli_test.go:138: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking3837640790
--- FAIL: TestDockerCLI_test_demo_networking (85.86s)
=== RUN TestDockerCLI_test_demo_networking/test_demo_networking.tcl
util_docker.go:257: non-zero exit code: 1
--- FAIL: TestDockerCLI_test_demo_networking/test_demo_networking.tcl (76.90s)
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
This is occurring more frequently but it appears to be the same issue as before.
The PR needs to be updated to fix the static expect port mapping for this test.
We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.
acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ 4593ae70148871f7f346a93123d60a7bd2c4955e:
Help
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
/cc @cockroachdb/sql-foundations
This test on roachdash | Improve this report!
Jira issue: CRDB-35686