cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.99k stars 3.79k forks source link

ccl/changefeedccl: TestChangefeedWithNoDistributionStrategy failed #120470

Open cockroach-teamcity opened 6 months ago

cockroach-teamcity commented 6 months ago

ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 455b16592df7d8efd121b3ba1256fb477e227564:

=== RUN   TestChangefeedWithNoDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy3469676430
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected an external process virtual cluster under test; see comment at top of test_server_shim.go for details.
    changefeed_dist_test.go:229: found partitions: [{1 /Tenant/10/Table/104/1{-/2}, /Tenant/10/Table/104/1/{39-40}} {2 /Tenant/10/Table/104/1/{2-4}} {3 /Tenant/10/Table/104/1/{4-8}} {4 /Tenant/10/Table/104/1/{8-16}} {5 /Tenant/10/Table/104/1/{16-32}} {6 /Tenant/10/Table/104/1/3{2-9}, /Tenant/10/Table/104/{1/40-2}}]
    changefeed_dist_test.go:338: range counts: [3 2 4 8 16 31 0 0]
    changefeed_dist_test.go:370: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:370
            Error:          Should be true
            Test:           TestChangefeedWithNoDistributionStrategy
            Messages:       unexpected counts [3 2 4 8 16 31 0 0], partitions: [{1 /Tenant/10/Table/104/1{-/2}, /Tenant/10/Table/104/1/{39-40}} {2 /Tenant/10/Table/104/1/{2-4}} {3 /Tenant/10/Table/104/1/{4-8}} {4 /Tenant/10/Table/104/1/{8-16}} {5 /Tenant/10/Table/104/1/{16-32}} {6 /Tenant/10/Table/104/1/3{2-9}, /Tenant/10/Table/104/{1/40-2}}]
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy3469676430
--- FAIL: TestChangefeedWithNoDistributionStrategy (149.87s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/cdc

This test on roachdash | Improve this report!

Jira issue: CRDB-36705

cockroach-teamcity commented 6 months ago

ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 067e48d29b9093038f6fcf2074cd761ffdcd4fe2:

=== RUN   TestChangefeedWithNoDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy954441901
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    changefeed_dist_test.go:229: found partitions: [{1 /Tenant/10/Table/104/1{-/2}} {2 /Tenant/10/Table/104/1/{2-4}} {3 /Tenant/10/Table/104/1/{4-8}} {4 /Tenant/10/Table/104/1/{8-16}} {5 /Tenant/10/Table/104/1/{16-32}} {6 /Tenant/10/Table/104/1/{32-50}, /Tenant/10/Table/104/{1/51-2}} {8 /Tenant/10/Table/104/1/5{0-1}}]
    changefeed_dist_test.go:338: range counts: [2 2 4 8 16 31 0 1]
    changefeed_dist_test.go:370: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:370
            Error:          Should be true
            Test:           TestChangefeedWithNoDistributionStrategy
            Messages:       unexpected counts [2 2 4 8 16 31 0 1], partitions: [{1 /Tenant/10/Table/104/1{-/2}} {2 /Tenant/10/Table/104/1/{2-4}} {3 /Tenant/10/Table/104/1/{4-8}} {4 /Tenant/10/Table/104/1/{8-16}} {5 /Tenant/10/Table/104/1/{16-32}} {6 /Tenant/10/Table/104/1/{32-50}, /Tenant/10/Table/104/{1/51-2}} {8 /Tenant/10/Table/104/1/5{0-1}}]
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy954441901
--- FAIL: TestChangefeedWithNoDistributionStrategy (106.03s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

cockroach-teamcity commented 6 months ago

ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 0b6de9c809f8a4df2ba943a8c9dd023adb03b01d:

=== RUN   TestChangefeedWithNoDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy2760564705
    test_log_scope.go:81: use -show-logs to present logs inline
    changefeed_dist_test.go:229: found partitions: [{1 /Table/104/1{-/2}} {2 /Table/104/1/{2-4}} {3 /Table/104/1/{4-8}, /Table/104/{1/63-2}} {4 /Table/104/1/{8-16}} {5 /Table/104/1/{16-32}} {6 /Table/104/1/{32-63}}]
    changefeed_dist_test.go:338: range counts: [2 2 5 8 16 31 0 0]
    changefeed_dist_test.go:370: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:370
            Error:          Should be true
            Test:           TestChangefeedWithNoDistributionStrategy
            Messages:       unexpected counts [2 2 5 8 16 31 0 0], partitions: [{1 /Table/104/1{-/2}} {2 /Table/104/1/{2-4}} {3 /Table/104/1/{4-8}, /Table/104/{1/63-2}} {4 /Table/104/1/{8-16}} {5 /Table/104/1/{16-32}} {6 /Table/104/1/{32-63}}]
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy2760564705
--- FAIL: TestChangefeedWithNoDistributionStrategy (88.94s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

jayshrivastava commented 6 months ago

I wonder if computing log(0) has something to do with this :) https://github.com/cockroachdb/cockroach/blob/0039f06034104aeafa19d4f2e1e020e433663e1b/pkg/ccl/changefeedccl/changefeed_dist_test.go#L297-L299

jayshrivastava commented 6 months ago

Running ALTER TABLE x EXPERIMENTAL_RELOCATE VALUES (ARRAY[-9223372036854775807], 0) succeeds without an error. I'm guessing we get these garbage values which are ignored 99.9% of the time and the range stays on node 1, which is its original node. These flakes are probably caused by nodeID := int(math.Floor(math.Log2(float64(i)))) + 1 returning an actual node id sometimes.

wenyihu6 commented 6 months ago

Reopening - Jay and I discussed more on this issue. It seems that the pr fix should be done but doesn't explain the flake well. Tagging current L2 @andyyang890.

cockroach-teamcity commented 6 months ago

ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 3060756a084cc437114d64265e79ace5720a7317:

=== RUN   TestChangefeedWithNoDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy348384707
    test_log_scope.go:81: use -show-logs to present logs inline
    changefeed_dist_test.go:345: found partitions: [{1 /Table/104/1{-/2} true 2} {2 /Table/104/1/{2-4} true 2} {3 /Table/104/1/{4-8}, /Table/104/1/5{4-5} true 5} {4 /Table/104/1/{8-16} true 8} {5 /Table/104/1/{16-32} true 16} {6 /Table/104/1/{32-54}, /Table/104/{1/55-2} true 31}]
    changefeed_dist_test.go:458: range counts: [2 2 5 8 16 31 0 0]
    changefeed_dist_test.go:490: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:490
            Error:          Should be true
            Test:           TestChangefeedWithNoDistributionStrategy
            Messages:       unexpected counts [2 2 5 8 16 31 0 0], partitions: [{1 /Table/104/1{-/2} true 2} {2 /Table/104/1/{2-4} true 2} {3 /Table/104/1/{4-8}, /Table/104/1/5{4-5} true 5} {4 /Table/104/1/{8-16} true 8} {5 /Table/104/1/{16-32} true 16} {6 /Table/104/1/{32-54}, /Table/104/{1/55-2} true 31}]
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy348384707
--- FAIL: TestChangefeedWithNoDistributionStrategy (89.27s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

cockroach-teamcity commented 6 months ago

ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ c994982a8be5af89f594e115e897dd6d62cf99d8:

=== RUN   TestChangefeedWithNoDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy1012625163
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected an external process virtual cluster under test; see comment at top of test_server_shim.go for details.
    changefeed_dist_test.go:345: found partitions: [{1 /Tenant/10/Table/104/1{-/2}, /Tenant/10/Table/104/1/2{1-2} true 3} {2 /Tenant/10/Table/104/1/{2-4}, /Tenant/10/Table/104/1/2{0-1} true 3} {3 /Tenant/10/Table/104/1/{4-8}, /Tenant/10/Table/104/1/1{6-7}, /Tenant/10/Table/104/1/3{0-1} true 6} {4 /Tenant/10/Table/104/1/{8-16}, /Tenant/10/Table/104/1/2{6-7}, /Tenant/10/Table/104/1/3{1-2} true 10} {8 /Tenant/10/Table/104/1/{17-20}, /Tenant/10/Table/104/1/2{3-5}, /Tenant/10/Table/104/1/2{7-9} true 7} {6 /Tenant/10/Table/104/1/2{2-3}, /Tenant/10/Table/104/1/{29-30}, /Tenant/10/Table/104/{1/32-2} true 34} {7 /Tenant/10/Table/104/1/2{5-6} true 1}]
    changefeed_dist_test.go:458: range counts: [3 3 6 10 0 34 1 7]
    changefeed_dist_test.go:491: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:491
            Error:          Should be true
            Test:           TestChangefeedWithNoDistributionStrategy
            Messages:       unexpected counts [3 3 6 10 0 34 1 7], partitions: [{1 /Tenant/10/Table/104/1{-/2}, /Tenant/10/Table/104/1/2{1-2} true 3} {2 /Tenant/10/Table/104/1/{2-4}, /Tenant/10/Table/104/1/2{0-1} true 3} {3 /Tenant/10/Table/104/1/{4-8}, /Tenant/10/Table/104/1/1{6-7}, /Tenant/10/Table/104/1/3{0-1} true 6} {4 /Tenant/10/Table/104/1/{8-16}, /Tenant/10/Table/104/1/2{6-7}, /Tenant/10/Table/104/1/3{1-2} true 10} {8 /Tenant/10/Table/104/1/{17-20}, /Tenant/10/Table/104/1/2{3-5}, /Tenant/10/Table/104/1/2{7-9} true 7} {6 /Tenant/10/Table/104/1/2{2-3}, /Tenant/10/Table/104/1/{29-30}, /Tenant/10/Table/104/{1/32-2} true 34} {7 /Tenant/10/Table/104/1/2{5-6} true 1}]
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy1012625163
--- FAIL: TestChangefeedWithNoDistributionStrategy (154.46s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

andyyang890 commented 6 months ago

I have a hypothesis that what is happening with this test failure is that the cluster settings set in the test here https://github.com/cockroachdb/cockroach/blob/1c6c01c598774c12800ee2ff6023489ba978f19e/pkg/ccl/changefeedccl/changefeed_dist_test.go#L486-L487

sometimes do not finish propagating before we create the changefeed.

Reasons for my suspicions:

Next steps:

cockroach-teamcity commented 6 months ago

ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 7488e090daa588c4d7c0f828c8006bb9b13a90f6:

=== RUN   TestChangefeedWithNoDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy2362866231
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected an external process virtual cluster under test; see comment at top of test_server_shim.go for details.
    changefeed_dist_test.go:345: found partitions: [{2 /Tenant/10/Table/104/1{-/4} true 4} {3 /Tenant/10/Table/104/1/{4-8} true 4} {4 /Tenant/10/Table/104/1/{8-16} true 8} {5 /Tenant/10/Table/104/1/{16-32} true 16} {6 /Tenant/10/Table/104/{1/32-2} true 32}]
    changefeed_dist_test.go:458: range counts: [0 4 4 8 16 32 0 0]
    changefeed_dist_test.go:491: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:491
            Error:          Should be true
            Test:           TestChangefeedWithNoDistributionStrategy
            Messages:       unexpected counts [0 4 4 8 16 32 0 0], partitions: [{2 /Tenant/10/Table/104/1{-/4} true 4} {3 /Tenant/10/Table/104/1/{4-8} true 4} {4 /Tenant/10/Table/104/1/{8-16} true 8} {5 /Tenant/10/Table/104/1/{16-32} true 16} {6 /Tenant/10/Table/104/{1/32-2} true 32}]
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy2362866231
--- FAIL: TestChangefeedWithNoDistributionStrategy (115.24s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

rharding6373 commented 6 months ago

The last failure doesn't have your commit in it.

andyyang890 commented 6 months ago

The last failure doesn't have your commit in it.

Yep, although I'm seeing evidence that the cluster setting propagation hypothesis cannot explain the test flakes. Both in last night's run and a local test run where I was able to finally reproduce it, the SET CLUSTER SETTING takes place on the same node as the changefeed planning. Also, in my local test run, I ran it with some additional verbose logging and there are no logs showing that the rebalancing or bulk oracle code paths are executed. We'll need to keep investigating.

cockroach-teamcity commented 6 months ago

ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ 3292e23e914f2b5a63e892fef1705ad476be4bd7:

=== RUN   TestChangefeedWithNoDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy294299260
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected an external process virtual cluster under test; see comment at top of test_server_shim.go for details.
    changefeed_dist_test.go:486: sql: database is closed
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy294299260
--- FAIL: TestChangefeedWithNoDistributionStrategy (151.89s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

andyyang890 commented 6 months ago

Last night's failure seems like some kind of infra flake but the previous failures might have something interesting in the logs. Maybe it's something to do with ranges being merged again and/or the ranges not being relocated like we expect them to be?

rickystewart commented 6 months ago

This test is very flaky. Can it be skipped?

For example, here is a failure just from last night. There are several recent failures in TeamCity CI.

cockroach-teamcity commented 6 months ago

ccl/changefeedccl.TestChangefeedWithNoDistributionStrategy failed on master @ a3ac7ebf958f25201c2696a17f996c0b9f86830f:

=== RUN   TestChangefeedWithNoDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithNoDistributionStrategy2883556737
    test_log_scope.go:81: use -show-logs to present logs inline
    changefeed_dist_test.go:345: found partitions: [{1 /Table/104/1{-/2} true 2} {2 /Table/104/1/{2-4} true 2} {3 /Table/104/1/{4-8} true 4} {4 /Table/104/1/{8-16} true 8} {5 /Table/104/1/{16-32} true 16} {6 /Table/104/1/{32-56}, /Table/104/{1/57-2} true 31} {7 /Table/104/1/5{6-7} true 1}]
    changefeed_dist_test.go:458: range counts: [2 2 4 8 16 31 1 0]
    changefeed_dist_test.go:491: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:491
            Error:          Should be true
            Test:           TestChangefeedWithNoDistributionStrategy
            Messages:       unexpected counts [2 2 4 8 16 31 1 0], partitions: [{1 /Table/104/1{-/2} true 2} {2 /Table/104/1/{2-4} true 2} {3 /Table/104/1/{4-8} true 4} {4 /Table/104/1/{8-16} true 8} {5 /Table/104/1/{16-32} true 16} {6 /Table/104/1/{32-56}, /Table/104/{1/57-2} true 31} {7 /Table/104/1/5{6-7} true 1}]
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithNoDistributionStrategy2883556737
--- FAIL: TestChangefeedWithNoDistributionStrategy (67.73s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

- #121338 ccl/changefeedccl: TestChangefeedWithNoDistributionStrategy failed [A-cdc C-test-failure O-robot T-cdc branch-release-24.1 release-blocker]

This test on roachdash | Improve this report!

andyyang890 commented 6 months ago

This test is very flaky. Can it be skipped?

Ack, I'll put up a PR skipping it.

rharding6373 commented 5 months ago

Note that when this is investigated and fixed we'll need to unskip the test and backport it to 24.1

andyyang890 commented 2 months ago

Reducing the priority of this test failure since it tests the distribution when the bulk oracle is off, but the bulk oracle is now on by default and generally recommended.