cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.2k stars 3.82k forks source link

ccl/changefeedccl: TestChangefeedWithSimpleDistributionStrategy failed #121408

Open cockroach-teamcity opened 8 months ago

cockroach-teamcity commented 8 months ago

ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 2a5e231716c436781f12452d800651f51c6383b7:

=== RUN   TestChangefeedWithSimpleDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy41154234
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected an external process virtual cluster under test; see comment at top of test_server_shim.go for details.
    changefeed_dist_test.go:345: found partitions: [{1 /Tenant/10/Table/104/1{-/2}, /Tenant/10/Table/104/1/{9-10}, /Tenant/10/Table/104/1/1{2-4}, /Tenant/10/Table/104/1/{41-50} true 14} {2 /Tenant/10/Table/104/1/{2-4}, /Tenant/10/Table/104/1/1{0-2}, /Tenant/10/Table/104/1/1{5-6}, /Tenant/10/Table/104/1/{32-41} true 14} {3 /Tenant/10/Table/104/1/{4-9}, /Tenant/10/Table/104/1/1{4-5}, /Tenant/10/Table/104/1/1{6-8} true 8} {6 /Tenant/10/Table/104/{1/50-2} true 14} {5 /Tenant/10/Table/104/1/{18-32} true 14}]
    changefeed_dist_test.go:458: range counts: [14 14 8 0 14 14 0 0]
    changefeed_dist_test.go:522: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:522
            Error:          "14" is not less than or equal to "12"
            Test:           TestChangefeedWithSimpleDistributionStrategy
            Messages:       counts [14 14 8 0 14 14 0 0] contains value greater than upper bound 12
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy41154234
--- FAIL: TestChangefeedWithSimpleDistributionStrategy (168.55s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/cdc

This test on roachdash | Improve this report!

Jira issue: CRDB-37230

rharding6373 commented 8 months ago

This test expects the 64 ranges to be evenly distributed across 6 nodes, but they were only distributed to 5 (note that node 4 is missing from the "found partitions" list). The balanced distribution algorithm only distributes to nodes initially assigned by distsql PartitionSpans, so it's most likely that distsql only used 5/6 nodes. All 64 ranges are accounted for, so distsql may not have used node 4 for some reason. Did something happen to that node during the test? I was unable to replicate this test using the random seed.

I think that the test could be made more resilient, and we should look whether we can enable the expensive logging for the test so we have useful debug information should this happen again.

Before removing the release blocker label, let's try to stress the test with the new logging around the rebalancer enabled and see if we can get a repro. I don't think this is a bug in the changefeed rebalancer, but it would be great to confirm that using the logs.

wenyihu6 commented 8 months ago

I think Jay mentioned that the planner could return fewer nodes than available in the thread here https://cockroachlabs.slack.com/archives/C065X5307U3/p1709060892451439?thread_ts=1709052881.656199&cid=C065X5307U3. Do you might know why?

rharding6373 commented 8 months ago

There are a few reasons why this might happen:

  1. The oracle didn't use a node (e.g., not a leaseholder, or isn't a replica depending on the type of oracle)
  2. The node is determined to be unhealthy at the time of planning
  3. There's a locality filter that doesn't include a node (CDC doesn't care about this, since we don't want to plan changefeeds on nodes outside of the locality filter)

I could see 1 or 2 happening in this test. Maybe something happened to node 4 when the test ran, so it was determined to be unhealthy during planning and the ranges it would have had were planned on a different node instead. Or maybe during the test the leaseholders changed from their original assignments and so the oracle didn't return node 4 for any spans.

Alternatively, maybe there is some indeterminism in the test setup that splits the ranges across the 6 nodes, so the oracle didn't return node 4 because it never had leases for any ranges. I did a cursory look at this code and it seemed deterministic, but maybe I missed something.

None of these point to a bug in the database code, just things that could be hardened in the test. So I'm going to remove the release blocker label.

cockroach-teamcity commented 7 months ago

ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 3c643b003890560b16c4fee1d1c18bea1871803b:

Fatal error:

panic: test timed out after 14m57s
running tests:
    TestChangefeedWithSimpleDistributionStrategy (14m54s)

Stack:

goroutine 476663 [running]:
testing.(*M).startAlarm.func1()
    GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
    GOROOT/src/time/sleep.go:177 +0x2d
Log preceding fatal error

``` === RUN TestChangefeedWithSimpleDistributionStrategy test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy2620796370 test_log_scope.go:81: use -show-logs to present logs inline test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details. ```

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

cockroach-teamcity commented 7 months ago

ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 934684d8134fd8bb34eae1a37f3aa83a4ac066b7:

Fatal error:

panic: test timed out after 14m57s
running tests:
    TestChangefeedWithSimpleDistributionStrategy (14m53s)

Stack:

goroutine 393766 [running]:
testing.(*M).startAlarm.func1()
    GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
    GOROOT/src/time/sleep.go:177 +0x2d
Log preceding fatal error

``` === RUN TestChangefeedWithSimpleDistributionStrategy test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy804844112 test_log_scope.go:81: use -show-logs to present logs inline test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details. ```

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

cockroach-teamcity commented 7 months ago

ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 714dc4da48a3d2a07b5d097542808982f848f704:

Fatal error:

panic: test timed out after 14m57s
running tests:
    TestChangefeedWithSimpleDistributionStrategy (14m54s)

Stack:

goroutine 378904 [running]:
testing.(*M).startAlarm.func1()
    GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
    GOROOT/src/time/sleep.go:177 +0x2d
Log preceding fatal error

``` === RUN TestChangefeedWithSimpleDistributionStrategy test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy4162182713 test_log_scope.go:81: use -show-logs to present logs inline test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details. ```

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

cockroach-teamcity commented 7 months ago

ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 737df7fe75c5698d9ba384ad322f30963582cc42:

=== RUN   TestChangefeedWithSimpleDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy2950077343
    test_log_scope.go:81: use -show-logs to present logs inline
    changefeed_dist_test.go:345: found partitions: [{5 /Table/104/1{-/32} true 32} {6 /Table/104/{1/32-2} true 32}]
    changefeed_dist_test.go:458: range counts: [0 0 0 0 32 32 0 0]
    changefeed_dist_test.go:522: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:522
            Error:          "32" is not less than or equal to "12"
            Test:           TestChangefeedWithSimpleDistributionStrategy
            Messages:       counts [0 0 0 0 32 32 0 0] contains value greater than upper bound 12
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy2950077343
--- FAIL: TestChangefeedWithSimpleDistributionStrategy (104.66s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

rytaft commented 7 months ago

Saw this here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/14916124?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

cockroach-teamcity commented 7 months ago

ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 347cdc76d4c5abb2e872f325e944337a46b5883f:

=== RUN   TestChangefeedWithSimpleDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy1006945489
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    changefeed_dist_test.go:345: found partitions: [{1 /Tenant/10/Table/104/1{-/9}, /Tenant/10/Table/104/1/1{4-8}, /Tenant/10/Table/104/1/3{1-2} true 14} {3 /Tenant/10/Table/104/1/4{0-5}, /Tenant/10/Table/104/1/4{8-9}, /Tenant/10/Table/104/1/5{0-1}, /Tenant/10/Table/104/1/5{2-3}, /Tenant/10/Table/104/1/5{5-7}, /Tenant/10/Table/104/1/{59-60}, /Tenant/10/Table/104/{1/61-2} true 14} {4 /Tenant/10/Table/104/1/{9-14}, /Tenant/10/Table/104/1/3{3-4}, /Tenant/10/Table/104/1/3{5-7}, /Tenant/10/Table/104/1/{38-40} true 10} {5 /Tenant/10/Table/104/1/{18-31} true 13} {6 /Tenant/10/Table/104/1/3{2-3}, /Tenant/10/Table/104/1/3{4-5}, /Tenant/10/Table/104/1/3{7-8}, /Tenant/10/Table/104/1/4{5-8}, /Tenant/10/Table/104/1/{49-50}, /Tenant/10/Table/104/1/5{1-2}, /Tenant/10/Table/104/1/5{3-5}, /Tenant/10/Table/104/1/5{7-9}, /Tenant/10/Table/104/1/6{0-1} true 13}]
    changefeed_dist_test.go:458: range counts: [14 0 14 10 13 13 0 0]
    changefeed_dist_test.go:522: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:522
            Error:          "14" is not less than or equal to "12"
            Test:           TestChangefeedWithSimpleDistributionStrategy
            Messages:       counts [14 0 14 10 13 13 0 0] contains value greater than upper bound 12
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy1006945489
--- FAIL: TestChangefeedWithSimpleDistributionStrategy (167.45s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

wenyihu6 commented 7 months ago

Skipping it in https://github.com/cockroachdb/cockroach/pull/122814.

andyyang890 commented 3 months ago

Reassigning to myself since it's likely the same cause as #120470.