Open cockroach-teamcity opened 8 months ago
ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 067e48d29b9093038f6fcf2074cd761ffdcd4fe2:
=== RUN TestChangefeedWithNoDistributionStrategy
test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy954441901
test_log_scope.go:81: use -show-logs to present logs inline
test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
changefeed_dist_test.go:229: found partitions: [{1 /Tenant/10/Table/104/1{-/2}} {2 /Tenant/10/Table/104/1/{2-4}} {3 /Tenant/10/Table/104/1/{4-8}} {4 /Tenant/10/Table/104/1/{8-16}} {5 /Tenant/10/Table/104/1/{16-32}} {6 /Tenant/10/Table/104/1/{32-50}, /Tenant/10/Table/104/{1/51-2}} {8 /Tenant/10/Table/104/1/5{0-1}}]
changefeed_dist_test.go:338: range counts: [2 2 4 8 16 31 0 1]
changefeed_dist_test.go:370:
Error Trace: github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:370
Error: Should be true
Test: TestChangefeedWithNoDistributionStrategy
Messages: unexpected counts [2 2 4 8 16 31 0 1], partitions: [{1 /Tenant/10/Table/104/1{-/2}} {2 /Tenant/10/Table/104/1/{2-4}} {3 /Tenant/10/Table/104/1/{4-8}} {4 /Tenant/10/Table/104/1/{8-16}} {5 /Tenant/10/Table/104/1/{16-32}} {6 /Tenant/10/Table/104/1/{32-50}, /Tenant/10/Table/104/{1/51-2}} {8 /Tenant/10/Table/104/1/5{0-1}}]
panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy954441901
--- FAIL: TestChangefeedWithNoDistributionStrategy (106.03s)
Parameters:
=true
attempt=1
run=10
shard=16
ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 0b6de9c809f8a4df2ba943a8c9dd023adb03b01d:
=== RUN TestChangefeedWithNoDistributionStrategy
test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy2760564705
test_log_scope.go:81: use -show-logs to present logs inline
changefeed_dist_test.go:229: found partitions: [{1 /Table/104/1{-/2}} {2 /Table/104/1/{2-4}} {3 /Table/104/1/{4-8}, /Table/104/{1/63-2}} {4 /Table/104/1/{8-16}} {5 /Table/104/1/{16-32}} {6 /Table/104/1/{32-63}}]
changefeed_dist_test.go:338: range counts: [2 2 5 8 16 31 0 0]
changefeed_dist_test.go:370:
Error Trace: github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:370
Error: Should be true
Test: TestChangefeedWithNoDistributionStrategy
Messages: unexpected counts [2 2 5 8 16 31 0 0], partitions: [{1 /Table/104/1{-/2}} {2 /Table/104/1/{2-4}} {3 /Table/104/1/{4-8}, /Table/104/{1/63-2}} {4 /Table/104/1/{8-16}} {5 /Table/104/1/{16-32}} {6 /Table/104/1/{32-63}}]
panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy2760564705
--- FAIL: TestChangefeedWithNoDistributionStrategy (88.94s)
Parameters:
=true
attempt=1
run=7
shard=16
I wonder if computing log(0) has something to do with this :) https://github.com/cockroachdb/cockroach/blob/0039f06034104aeafa19d4f2e1e020e433663e1b/pkg/ccl/changefeedccl/changefeed_dist_test.go#L297-L299
Running ALTER TABLE x EXPERIMENTAL_RELOCATE VALUES (ARRAY[-9223372036854775807], 0)
succeeds without an error. I'm guessing we get these garbage values which are ignored 99.9% of the time and the range stays on node 1, which is its original node. These flakes are probably caused by nodeID := int(math.Floor(math.Log2(float64(i)))) + 1
returning an actual node id sometimes.
Reopening - Jay and I discussed more on this issue. It seems that the pr fix should be done but doesn't explain the flake well. Tagging current L2 @andyyang890.
ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 3060756a084cc437114d64265e79ace5720a7317:
=== RUN TestChangefeedWithNoDistributionStrategy
test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy348384707
test_log_scope.go:81: use -show-logs to present logs inline
changefeed_dist_test.go:345: found partitions: [{1 /Table/104/1{-/2} true 2} {2 /Table/104/1/{2-4} true 2} {3 /Table/104/1/{4-8}, /Table/104/1/5{4-5} true 5} {4 /Table/104/1/{8-16} true 8} {5 /Table/104/1/{16-32} true 16} {6 /Table/104/1/{32-54}, /Table/104/{1/55-2} true 31}]
changefeed_dist_test.go:458: range counts: [2 2 5 8 16 31 0 0]
changefeed_dist_test.go:490:
Error Trace: github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:490
Error: Should be true
Test: TestChangefeedWithNoDistributionStrategy
Messages: unexpected counts [2 2 5 8 16 31 0 0], partitions: [{1 /Table/104/1{-/2} true 2} {2 /Table/104/1/{2-4} true 2} {3 /Table/104/1/{4-8}, /Table/104/1/5{4-5} true 5} {4 /Table/104/1/{8-16} true 8} {5 /Table/104/1/{16-32} true 16} {6 /Table/104/1/{32-54}, /Table/104/{1/55-2} true 31}]
panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy348384707
--- FAIL: TestChangefeedWithNoDistributionStrategy (89.27s)
Parameters:
attempt=1
run=1
shard=16
ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ c994982a8be5af89f594e115e897dd6d62cf99d8:
=== RUN TestChangefeedWithNoDistributionStrategy
test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy1012625163
test_log_scope.go:81: use -show-logs to present logs inline
test_server_shim.go:157: automatically injected an external process virtual cluster under test; see comment at top of test_server_shim.go for details.
changefeed_dist_test.go:345: found partitions: [{1 /Tenant/10/Table/104/1{-/2}, /Tenant/10/Table/104/1/2{1-2} true 3} {2 /Tenant/10/Table/104/1/{2-4}, /Tenant/10/Table/104/1/2{0-1} true 3} {3 /Tenant/10/Table/104/1/{4-8}, /Tenant/10/Table/104/1/1{6-7}, /Tenant/10/Table/104/1/3{0-1} true 6} {4 /Tenant/10/Table/104/1/{8-16}, /Tenant/10/Table/104/1/2{6-7}, /Tenant/10/Table/104/1/3{1-2} true 10} {8 /Tenant/10/Table/104/1/{17-20}, /Tenant/10/Table/104/1/2{3-5}, /Tenant/10/Table/104/1/2{7-9} true 7} {6 /Tenant/10/Table/104/1/2{2-3}, /Tenant/10/Table/104/1/{29-30}, /Tenant/10/Table/104/{1/32-2} true 34} {7 /Tenant/10/Table/104/1/2{5-6} true 1}]
changefeed_dist_test.go:458: range counts: [3 3 6 10 0 34 1 7]
changefeed_dist_test.go:491:
Error Trace: github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:491
Error: Should be true
Test: TestChangefeedWithNoDistributionStrategy
Messages: unexpected counts [3 3 6 10 0 34 1 7], partitions: [{1 /Tenant/10/Table/104/1{-/2}, /Tenant/10/Table/104/1/2{1-2} true 3} {2 /Tenant/10/Table/104/1/{2-4}, /Tenant/10/Table/104/1/2{0-1} true 3} {3 /Tenant/10/Table/104/1/{4-8}, /Tenant/10/Table/104/1/1{6-7}, /Tenant/10/Table/104/1/3{0-1} true 6} {4 /Tenant/10/Table/104/1/{8-16}, /Tenant/10/Table/104/1/2{6-7}, /Tenant/10/Table/104/1/3{1-2} true 10} {8 /Tenant/10/Table/104/1/{17-20}, /Tenant/10/Table/104/1/2{3-5}, /Tenant/10/Table/104/1/2{7-9} true 7} {6 /Tenant/10/Table/104/1/2{2-3}, /Tenant/10/Table/104/1/{29-30}, /Tenant/10/Table/104/{1/32-2} true 34} {7 /Tenant/10/Table/104/1/2{5-6} true 1}]
panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy1012625163
--- FAIL: TestChangefeedWithNoDistributionStrategy (154.46s)
Parameters:
attempt=1
run=3
shard=16
I have a hypothesis that what is happening with this test failure is that the cluster settings set in the test here https://github.com/cockroachdb/cockroach/blob/1c6c01c598774c12800ee2ff6023489ba978f19e/pkg/ccl/changefeedccl/changefeed_dist_test.go#L486-L487
sometimes do not finish propagating before we create the changefeed.
Reasons for my suspicions:
[3 3 6 10 0 34 1 7]
. This wasn't previously observed and the timing lines up with when the bulk oracle setting was added.Next steps:
ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 7488e090daa588c4d7c0f828c8006bb9b13a90f6:
=== RUN TestChangefeedWithNoDistributionStrategy
test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy2362866231
test_log_scope.go:81: use -show-logs to present logs inline
test_server_shim.go:157: automatically injected an external process virtual cluster under test; see comment at top of test_server_shim.go for details.
changefeed_dist_test.go:345: found partitions: [{2 /Tenant/10/Table/104/1{-/4} true 4} {3 /Tenant/10/Table/104/1/{4-8} true 4} {4 /Tenant/10/Table/104/1/{8-16} true 8} {5 /Tenant/10/Table/104/1/{16-32} true 16} {6 /Tenant/10/Table/104/{1/32-2} true 32}]
changefeed_dist_test.go:458: range counts: [0 4 4 8 16 32 0 0]
changefeed_dist_test.go:491:
Error Trace: github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:491
Error: Should be true
Test: TestChangefeedWithNoDistributionStrategy
Messages: unexpected counts [0 4 4 8 16 32 0 0], partitions: [{2 /Tenant/10/Table/104/1{-/4} true 4} {3 /Tenant/10/Table/104/1/{4-8} true 4} {4 /Tenant/10/Table/104/1/{8-16} true 8} {5 /Tenant/10/Table/104/1/{16-32} true 16} {6 /Tenant/10/Table/104/{1/32-2} true 32}]
panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy2362866231
--- FAIL: TestChangefeedWithNoDistributionStrategy (115.24s)
Parameters:
attempt=1
run=20
shard=16
The last failure doesn't have your commit in it.
The last failure doesn't have your commit in it.
Yep, although I'm seeing evidence that the cluster setting propagation hypothesis cannot explain the test flakes. Both in last night's run and a local test run where I was able to finally reproduce it, the SET CLUSTER SETTING
takes place on the same node as the changefeed planning. Also, in my local test run, I ran it with some additional verbose logging and there are no logs showing that the rebalancing or bulk oracle code paths are executed. We'll need to keep investigating.
ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 3292e23e914f2b5a63e892fef1705ad476be4bd7:
=== RUN TestChangefeedWithNoDistributionStrategy
test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy294299260
test_log_scope.go:81: use -show-logs to present logs inline
test_server_shim.go:157: automatically injected an external process virtual cluster under test; see comment at top of test_server_shim.go for details.
changefeed_dist_test.go:486: sql: database is closed
panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy294299260
--- FAIL: TestChangefeedWithNoDistributionStrategy (151.89s)
Parameters:
attempt=1
run=4
shard=16
Last night's failure seems like some kind of infra flake but the previous failures might have something interesting in the logs. Maybe it's something to do with ranges being merged again and/or the ranges not being relocated like we expect them to be?
This test is very flaky. Can it be skipped?
For example, here is a failure just from last night. There are several recent failures in TeamCity CI.
ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ a3ac7ebf958f25201c2696a17f996c0b9f86830f:
=== RUN TestChangefeedWithNoDistributionStrategy
test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy2883556737
test_log_scope.go:81: use -show-logs to present logs inline
changefeed_dist_test.go:345: found partitions: [{1 /Table/104/1{-/2} true 2} {2 /Table/104/1/{2-4} true 2} {3 /Table/104/1/{4-8} true 4} {4 /Table/104/1/{8-16} true 8} {5 /Table/104/1/{16-32} true 16} {6 /Table/104/1/{32-56}, /Table/104/{1/57-2} true 31} {7 /Table/104/1/5{6-7} true 1}]
changefeed_dist_test.go:458: range counts: [2 2 4 8 16 31 1 0]
changefeed_dist_test.go:491:
Error Trace: github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:491
Error: Should be true
Test: TestChangefeedWithNoDistributionStrategy
Messages: unexpected counts [2 2 4 8 16 31 1 0], partitions: [{1 /Table/104/1{-/2} true 2} {2 /Table/104/1/{2-4} true 2} {3 /Table/104/1/{4-8} true 4} {4 /Table/104/1/{8-16} true 8} {5 /Table/104/1/{16-32} true 16} {6 /Table/104/1/{32-56}, /Table/104/{1/57-2} true 31} {7 /Table/104/1/5{6-7} true 1}]
panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy2883556737
--- FAIL: TestChangefeedWithNoDistributionStrategy (67.73s)
Parameters:
attempt=1
run=29
shard=16
See also: How To Investigate a Go Test Failure (internal)
- #121338 ccl/changefeedccl: TestChangefeedWithNoDistributionStrategy failed [A-cdc C-test-failure O-robot T-cdc branch-release-24.1 release-blocker]
This test is very flaky. Can it be skipped?
Ack, I'll put up a PR skipping it.
Note that when this is investigated and fixed we'll need to unskip the test and backport it to 24.1
Reducing the priority of this test failure since it tests the distribution when the bulk oracle is off, but the bulk oracle is now on by default and generally recommended.
ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 455b16592df7d8efd121b3ba1256fb477e227564:
Parameters:
attempt=1
run=4
shard=16
Help
See also: How To Investigate a Go Test Failure (internal)
/cc @cockroachdb/cdcThis test on roachdash | Improve this report!
Jira issue: CRDB-36705