cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

acceptance: TestDockerCLI_test_demo_networking failed [lease is transferred from s1->s2->s1 thousands of times] #118346

Closed cockroach-teamcity closed 8 months ago

cockroach-teamcity commented 9 months ago

acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ 4593ae70148871f7f346a93123d60a7bd2c4955e:

=== RUN   TestDockerCLI_test_demo_networking
    test_log_scope.go:170: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking2433442169
    test_log_scope.go:81: use -show-logs to present logs inline
    generated_cli_test.go:138: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking2433442169
--- FAIL: TestDockerCLI_test_demo_networking (92.77s)
=== RUN   TestDockerCLI_test_demo_networking/test_demo_networking.tcl
    util_docker.go:257: non-zero exit code: 1
    --- FAIL: TestDockerCLI_test_demo_networking/test_demo_networking.tcl (80.14s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-35686

rafiss commented 9 months ago

I'm not sure what went wrong here. The main symptom I see is that there is a lease ping-ponging between s1 and s2 thousands of times:

I240126 02:38:14.421128 11976 13@kv/kvserver/allocator/plan/replicate.go:1055 ⋮ [T1,Vsystem,n1,replicate,s1,r20/1:‹/Table/1{7-8}›] 3499  transferring lease to s2
I240126 02:38:14.421146 11976 13@kv/kvserver/replicate_queue.go:1076 ⋮ [T1,Vsystem,n1,replicate,s1,r20/1:‹/Table/1{7-8}›] 3500  transferring lease to s2
I240126 02:38:14.424352 11980 13@kv/kvserver/allocator/plan/replicate.go:1055 ⋮ [T1,Vsystem,n2,replicate,s2,r20/2:‹/Table/1{7-8}›] 3501  transferring lease to s1
I240126 02:38:14.424368 11980 13@kv/kvserver/replicate_queue.go:1076 ⋮ [T1,Vsystem,n2,replicate,s2,r20/2:‹/Table/1{7-8}›] 3502  transferring lease to s1

Moving to KV-Replication to see if they have any theories for why that could happen.

blathers-crl[bot] commented 9 months ago

cc @cockroachdb/replication

kvoli commented 9 months ago

The cluster sets RF=1 with 2 nodes, then observes pretty extreme thrashing.

This occurs after rebalancing (with RF=1) towards either n1 or n2, and we end up with 2 replicas for the range. The leaseholder then hits:

https://github.com/cockroachdb/cockroach/blob/317a2007b6eecd897f7a150c5de8b2070d716837/pkg/kv/kvserver/allocator/plan/replicate.go#L646-L650

Which will attempt to transfer the lease away first, before down-replicating back to RF=1. Its surprising the lease transfers are occurring so quickly, they should be waiting at least a second -- however maybeTransferLeaseAwayTarget doesn't pass in CanTransferLeaseAway(..), so it doesn't observe the min lease transfer interval:

https://github.com/cockroachdb/cockroach/blob/317a2007b6eecd897f7a150c5de8b2070d716837/pkg/kv/kvserver/replicate_queue.go#L1138-L1141

I wonder how cycle is supposed to end usually:

  1. a has 1 voter, rebalances to b [a]
  2. can't both swap b and a in one change replicas, so just add b [a*,b]
  3. (a|b) now notices the range is over-replicated and hits the maybe transfer lease path , transferring to (b|a) ([a,b] | [a,b])
  4. repeat 3.

I'll try out some simulations.

kvoli commented 9 months ago

20 minute simulated run with 100 ranges RF=1 and 2 nodes:

100.00 ┼──────╮      ╭───╮
 93.33 ┤      ╰╮    ╭╯   ╰────╮
 86.67 ┤       │    │         │
 80.00 ┤       ╰╮  ╭╯         ╰╮    ╭─────╮
 73.33 ┤        │  │           │   ╭╯     ╰─╮   ╭───╮
 66.67 ┤        ╰╮╭╯           ╰╮ ╭╯        ╰╮ ╭╯   ╰─╮  ╭──╮
 60.00 ┤         ││             │ │          │ │      ╰╮╭╯  ╰──╮ ╭──╮
 53.33 ┤         ╭╯             ╰╮╯          ╰╭╯       │╯      ╰╭╯  ╰──╮───────╭───────
 46.67 ┤         │╮             ╭╰╮          ╭╯╮       ╰╮      ╭╯╮  ╭──╰───────╯───────
 40.00 ┤         ││             │ │          │ │      ╭╯╰╮  ╭──╯ ╰──╯
 33.33 ┤        ╭╯╰╮           ╭╯ ╰╮        ╭╯ ╰╮   ╭─╯  ╰──╯
 26.67 ┤        │  │           │   ╰╮     ╭─╯   ╰───╯
 20.00 ┤       ╭╯  ╰╮         ╭╯    ╰─────╯
 13.33 ┤       │    │         │
  6.67 ┤      ╭╯    ╰╮   ╭────╯
  0.00 ┼──────╯      ╰───╯
                                             lease

It does stabilize (not completely) but takes a while. In this test, it could be worthwhile setting a higher RF, or not using 2 nodes.

kvoli commented 9 months ago

This is readily reproducible using a local roachprod cluster as well:

image image image

I will file an issue and open a PR to change this test to use 3 nodes instead.

kvoli commented 9 months ago

See https://github.com/cockroachdb/cockroach/issues/108420.

cockroach-teamcity commented 9 months ago

acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ d272e9ef5589deff570efc023db6c70edfde311c:

=== RUN   TestDockerCLI_test_demo_networking
    test_log_scope.go:170: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking1763553455
    test_log_scope.go:81: use -show-logs to present logs inline
    generated_cli_test.go:138: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking1763553455
--- FAIL: TestDockerCLI_test_demo_networking (88.93s)
=== RUN   TestDockerCLI_test_demo_networking/test_demo_networking.tcl
    util_docker.go:257: non-zero exit code: 1
    --- FAIL: TestDockerCLI_test_demo_networking/test_demo_networking.tcl (77.23s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

cockroach-teamcity commented 9 months ago

acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ 6b88dec162f403f6d35fee3796f55bc2bffe4c38:

=== RUN   TestDockerCLI_test_demo_networking
    test_log_scope.go:170: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking3042734063
    test_log_scope.go:81: use -show-logs to present logs inline
    generated_cli_test.go:138: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking3042734063
--- FAIL: TestDockerCLI_test_demo_networking (86.99s)
=== RUN   TestDockerCLI_test_demo_networking/test_demo_networking.tcl
    util_docker.go:257: non-zero exit code: 1
    --- FAIL: TestDockerCLI_test_demo_networking/test_demo_networking.tcl (78.69s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

cockroach-teamcity commented 9 months ago

acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ 6b88dec162f403f6d35fee3796f55bc2bffe4c38:

=== RUN   TestDockerCLI_test_demo_networking
    test_log_scope.go:170: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking2458737924
    test_log_scope.go:81: use -show-logs to present logs inline
    generated_cli_test.go:138: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking2458737924
--- FAIL: TestDockerCLI_test_demo_networking (88.32s)
=== RUN   TestDockerCLI_test_demo_networking/test_demo_networking.tcl
    util_docker.go:257: non-zero exit code: 1
    --- FAIL: TestDockerCLI_test_demo_networking/test_demo_networking.tcl (78.32s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

cockroach-teamcity commented 9 months ago

acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ 6b88dec162f403f6d35fee3796f55bc2bffe4c38:

=== RUN   TestDockerCLI_test_demo_networking
    test_log_scope.go:170: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking2423139381
    test_log_scope.go:81: use -show-logs to present logs inline
    generated_cli_test.go:138: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking2423139381
--- FAIL: TestDockerCLI_test_demo_networking (70.84s)
=== RUN   TestDockerCLI_test_demo_networking/test_demo_networking.tcl
    util_docker.go:257: non-zero exit code: 1
    --- FAIL: TestDockerCLI_test_demo_networking/test_demo_networking.tcl (68.04s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

cockroach-teamcity commented 9 months ago

acceptance.TestDockerCLI_test_demo_networking failed with artifacts on master @ 9a3b3c95b20b8dc10ce0f5e8184d4dfe3aa69689:

=== RUN   TestDockerCLI_test_demo_networking
    test_log_scope.go:170: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking3837640790
    test_log_scope.go:81: use -show-logs to present logs inline
    generated_cli_test.go:138: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/_tmp/43e343882df6bd7a3a9d93599612ca74/logTestDockerCLI_test_demo_networking3837640790
--- FAIL: TestDockerCLI_test_demo_networking (85.86s)
=== RUN   TestDockerCLI_test_demo_networking/test_demo_networking.tcl
    util_docker.go:257: non-zero exit code: 1
    --- FAIL: TestDockerCLI_test_demo_networking/test_demo_networking.tcl (76.90s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

kvoli commented 9 months ago

This is occurring more frequently but it appears to be the same issue as before.

kvoli commented 9 months ago

The PR needs to be updated to fix the static expect port mapping for this test.

github-actions[bot] commented 8 months ago

We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.