Open cockroach-teamcity opened 8 months ago
This test expects the 64 ranges to be evenly distributed across 6 nodes, but they were only distributed to 5 (note that node 4 is missing from the "found partitions" list). The balanced distribution algorithm only distributes to nodes initially assigned by distsql PartitionSpans, so it's most likely that distsql only used 5/6 nodes. All 64 ranges are accounted for, so distsql may not have used node 4 for some reason. Did something happen to that node during the test? I was unable to replicate this test using the random seed.
I think that the test could be made more resilient, and we should look whether we can enable the expensive logging for the test so we have useful debug information should this happen again.
Before removing the release blocker label, let's try to stress the test with the new logging around the rebalancer enabled and see if we can get a repro. I don't think this is a bug in the changefeed rebalancer, but it would be great to confirm that using the logs.
I think Jay mentioned that the planner could return fewer nodes than available in the thread here https://cockroachlabs.slack.com/archives/C065X5307U3/p1709060892451439?thread_ts=1709052881.656199&cid=C065X5307U3. Do you might know why?
There are a few reasons why this might happen:
I could see 1 or 2 happening in this test. Maybe something happened to node 4 when the test ran, so it was determined to be unhealthy during planning and the ranges it would have had were planned on a different node instead. Or maybe during the test the leaseholders changed from their original assignments and so the oracle didn't return node 4 for any spans.
Alternatively, maybe there is some indeterminism in the test setup that splits the ranges across the 6 nodes, so the oracle didn't return node 4 because it never had leases for any ranges. I did a cursory look at this code and it seemed deterministic, but maybe I missed something.
None of these point to a bug in the database code, just things that could be hardened in the test. So I'm going to remove the release blocker label.
ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 3c643b003890560b16c4fee1d1c18bea1871803b:
Fatal error:
panic: test timed out after 14m57s
running tests:
TestChangefeedWithSimpleDistributionStrategy (14m54s)
Stack:
goroutine 476663 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestChangefeedWithSimpleDistributionStrategy test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy2620796370 test_log_scope.go:81: use -show-logs to present logs inline test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details. ```
Parameters:
attempt=1
run=2
shard=1
ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 934684d8134fd8bb34eae1a37f3aa83a4ac066b7:
Fatal error:
panic: test timed out after 14m57s
running tests:
TestChangefeedWithSimpleDistributionStrategy (14m53s)
Stack:
goroutine 393766 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestChangefeedWithSimpleDistributionStrategy test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy804844112 test_log_scope.go:81: use -show-logs to present logs inline test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details. ```
Parameters:
attempt=1
run=5
shard=1
ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 714dc4da48a3d2a07b5d097542808982f848f704:
Fatal error:
panic: test timed out after 14m57s
running tests:
TestChangefeedWithSimpleDistributionStrategy (14m54s)
Stack:
goroutine 378904 [running]:
testing.(*M).startAlarm.func1()
GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
GOROOT/src/time/sleep.go:177 +0x2d
``` === RUN TestChangefeedWithSimpleDistributionStrategy test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy4162182713 test_log_scope.go:81: use -show-logs to present logs inline test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details. ```
Parameters:
attempt=1
run=1
shard=1
ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 737df7fe75c5698d9ba384ad322f30963582cc42:
=== RUN TestChangefeedWithSimpleDistributionStrategy
test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy2950077343
test_log_scope.go:81: use -show-logs to present logs inline
changefeed_dist_test.go:345: found partitions: [{5 /Table/104/1{-/32} true 32} {6 /Table/104/{1/32-2} true 32}]
changefeed_dist_test.go:458: range counts: [0 0 0 0 32 32 0 0]
changefeed_dist_test.go:522:
Error Trace: github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:522
Error: "32" is not less than or equal to "12"
Test: TestChangefeedWithSimpleDistributionStrategy
Messages: counts [0 0 0 0 32 32 0 0] contains value greater than upper bound 12
panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy2950077343
--- FAIL: TestChangefeedWithSimpleDistributionStrategy (104.66s)
Parameters:
attempt=1
run=12
shard=1
ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 347cdc76d4c5abb2e872f325e944337a46b5883f:
=== RUN TestChangefeedWithSimpleDistributionStrategy
test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy1006945489
test_log_scope.go:81: use -show-logs to present logs inline
test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
changefeed_dist_test.go:345: found partitions: [{1 /Tenant/10/Table/104/1{-/9}, /Tenant/10/Table/104/1/1{4-8}, /Tenant/10/Table/104/1/3{1-2} true 14} {3 /Tenant/10/Table/104/1/4{0-5}, /Tenant/10/Table/104/1/4{8-9}, /Tenant/10/Table/104/1/5{0-1}, /Tenant/10/Table/104/1/5{2-3}, /Tenant/10/Table/104/1/5{5-7}, /Tenant/10/Table/104/1/{59-60}, /Tenant/10/Table/104/{1/61-2} true 14} {4 /Tenant/10/Table/104/1/{9-14}, /Tenant/10/Table/104/1/3{3-4}, /Tenant/10/Table/104/1/3{5-7}, /Tenant/10/Table/104/1/{38-40} true 10} {5 /Tenant/10/Table/104/1/{18-31} true 13} {6 /Tenant/10/Table/104/1/3{2-3}, /Tenant/10/Table/104/1/3{4-5}, /Tenant/10/Table/104/1/3{7-8}, /Tenant/10/Table/104/1/4{5-8}, /Tenant/10/Table/104/1/{49-50}, /Tenant/10/Table/104/1/5{1-2}, /Tenant/10/Table/104/1/5{3-5}, /Tenant/10/Table/104/1/5{7-9}, /Tenant/10/Table/104/1/6{0-1} true 13}]
changefeed_dist_test.go:458: range counts: [14 0 14 10 13 13 0 0]
changefeed_dist_test.go:522:
Error Trace: github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:522
Error: "14" is not less than or equal to "12"
Test: TestChangefeedWithSimpleDistributionStrategy
Messages: counts [14 0 14 10 13 13 0 0] contains value greater than upper bound 12
panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy1006945489
--- FAIL: TestChangefeedWithSimpleDistributionStrategy (167.45s)
Parameters:
attempt=1
run=1
shard=1
Skipping it in https://github.com/cockroachdb/cockroach/pull/122814.
Reassigning to myself since it's likely the same cause as #120470.
ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 2a5e231716c436781f12452d800651f51c6383b7:
Parameters:
attempt=1
run=20
shard=1
Help
See also: How To Investigate a Go Test Failure (internal)
/cc @cockroachdb/cdcThis test on roachdash | Improve this report!
Jira issue: CRDB-37230