cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.08k stars 3.8k forks source link

ccl/multiregionccl: TestMultiRegionDataDriven failed #74783

Closed cockroach-teamcity closed 2 years ago

cockroach-teamcity commented 2 years ago

ccl/multiregionccl.TestMultiRegionDataDriven failed with artifacts on master @ c3d71ac887844bef174abb6dab2a4e1ce9270ab7:

=== RUN   TestMultiRegionDataDriven
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven853361253
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestMultiRegionDataDriven
    datadriven_test.go:390: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven853361253
--- FAIL: TestMultiRegionDataDriven (42.01s)
=== RUN   TestMultiRegionDataDriven/regional_by_table
    datadriven_test.go:122: 
        testdata/regional_by_table:226: SELECT * FROM db.rbt AS OF SYSTEM TIME follower_read_timestamp() WHERE k = 1
        expected:
        served locally: true
        served via follower read: true

        found:
        served locally: true
        served via follower read: false
    --- FAIL: TestMultiRegionDataDriven/regional_by_table (11.54s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM) Parameters in this failure: - GOFLAGS=-parallel=4

Same failure on other branches

- #73829 ccl/multiregionccl: TestMultiRegionDataDriven failed [C-test-failure O-robot branch-release-21.2]

/cc @cockroachdb/multiregion

This test on roachdash | Improve this report!

Jira issue: CRDB-12258

cockroach-teamcity commented 2 years ago

ccl/multiregionccl.TestMultiRegionDataDriven failed with artifacts on master @ 912964e02ddd951c77d4f71981ae18b3894e9084:

=== RUN   TestMultiRegionDataDriven
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven3462472544
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestMultiRegionDataDriven
    datadriven_test.go:390: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven3462472544
--- FAIL: TestMultiRegionDataDriven (97.05s)
=== RUN   TestMultiRegionDataDriven/regional_by_table
    datadriven_test.go:122: 
        testdata/regional_by_table:101: SELECT * FROM db.rbt WHERE k = 1
        expected:
        served locally: true
        served via follower read: false

        found:
        served locally: false
    --- FAIL: TestMultiRegionDataDriven/regional_by_table (11.06s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM) Parameters in this failure: - GOFLAGS=-parallel=4

Same failure on other branches

- #73829 ccl/multiregionccl: TestMultiRegionDataDriven failed [C-test-failure O-robot branch-release-21.2]

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 years ago

ccl/multiregionccl.TestMultiRegionDataDriven failed with artifacts on master @ 9b0065ca2450205a34d4237be7b317c2d895658c:

=== RUN   TestMultiRegionDataDriven
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven2898876028
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestMultiRegionDataDriven
    datadriven_test.go:390: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven2898876028
--- FAIL: TestMultiRegionDataDriven (41.73s)
=== RUN   TestMultiRegionDataDriven/regional_by_table
    datadriven_test.go:122: 
        testdata/regional_by_table:154: SELECT * FROM db.rbt AS OF SYSTEM TIME follower_read_timestamp() WHERE k = 1
        expected:
        served locally: false

        found:
        served locally: true
        served via follower read: true
    --- FAIL: TestMultiRegionDataDriven/regional_by_table (12.23s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM) Parameters in this failure: - GOFLAGS=-parallel=4

Same failure on other branches

- #73829 ccl/multiregionccl: TestMultiRegionDataDriven failed [C-test-failure O-robot branch-release-21.2]

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 years ago

ccl/multiregionccl.TestMultiRegionDataDriven failed with artifacts on master @ e1068d77afbd39b162978281c9da7cbea49c1c3a:

../../util/stop/stopper.go:327 stop.(*Stopper).RunTask { f(ctx) }
../../server/node.go:863 server.(*Node).writeNodeStatus { err = n.recorder.WriteNodeStatus(ctx, n.storeCfg.DB, *nodeStatus, mustExist) }
../../server/node.go:793 server.(*Node).startWriteNodeStatus { // node decommissioning, so we have to error out if we can't create it. }
../../server/server.go:1802 server.(*Server).PreStart { // Begin recording status summaries. }
../../server/server.go:1253 server.(*Server).Start { func (s *Server) Start(ctx context.Context) error { }
../../server/testserver.go:499 server.(*TestServer).Start { func (ts *TestServer) Start(ctx context.Context) error { }
../../testutils/testcluster/testcluster.go:485 testcluster.(*TestCluster).startServer { server := tc.Servers[idx] }
../../testutils/testcluster/testcluster.go:315 testcluster.(*TestCluster).Start { } else { }
../../testutils/testcluster/testcluster.go:185 testcluster.StartTestCluster { cluster.Start(t) }
datadriven_test.go:164 multiregionccl_test.TestMultiRegionDataDriven.func1.1 { }) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:321 datadriven.runDirective.func1 { actual := f(t, d) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:354 datadriven.runDirective { expectedLines := difflib.SplitLines(d.Expected) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:196 datadriven.runDirectiveOrSubTest { } }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:166 datadriven.runTestInternal { r := newTestDataReader(t, sourceName, reader, rewrite) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:136 datadriven.RunTest { rewriteData := runTestInternal(t, path, file, f, *rewriteTestFiles) }
datadriven_test.go:388 multiregionccl_test.TestMultiRegionDataDriven.func1 { }) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:413 datadriven.Walk { f(t, path) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:426 datadriven.Walk.func1 { Walk(t, filepath.Join(path, file.Name()), f) }

goroutine 38921 lock 0xc0040ecca8
../../util/tracing/crdbspan.go:360 tracing.(*crdbSpan).getVerboseRecording { // grandchildren there are. } <<<<<
../../util/tracing/crdbspan.go:357 tracing.(*crdbSpan).getVerboseRecording {  }
../../util/tracing/crdbspan.go:338 tracing.(*crdbSpan).getRecordingImpl { case RecordingVerbose: }
../../util/tracing/crdbspan.go:779 tracing.(*crdbSpan).childFinished { rec = child.GetRecording(RecordingVerbose, false /* finishing - the child is already finished */) }
../../util/tracing/crdbspan.go:321 tracing.(*crdbSpan).GetRecording { func (s *crdbSpan) GetRecording(recType RecordingType, finishing bool) Recording { }
../../util/tracing/crdbspan.go:242 tracing.(*crdbSpan).finish { parent.Span.i.crdb.childFinished(s) }
../../util/tracing/span_inner.go:91 tracing.(*spanInner).Finish {  }
../../util/tracing/span_inner.go:226 tracing.(*spanInner).Tracer { func (s *spanInner) Tracer() *Tracer { }
../../util/tracing/span.go:246 tracing.(*Span).Finish { sp.finishInternal() }
../../kv/kvserver/replica_range_lease.go:480 kvserver.(*pendingLeaseRequest).requestLeaseAsync.func2 { p.cancelLocked() }
../../util/stop/stopper.go:494 stop.(*Stopper).RunAsyncTaskEx.func2 { f(ctx) }

goroutine 38921 lock 0xc0040edba8
../../util/tracing/crdbspan.go:726 tracing.(*crdbSpan).childFinished { s.mu.Lock() } <<<<<
../../util/tracing/crdbspan.go:725 tracing.(*crdbSpan).childFinished { func (s *crdbSpan) childFinished(child *crdbSpan) { }
../../util/tracing/crdbspan.go:242 tracing.(*crdbSpan).finish { parent.Span.i.crdb.childFinished(s) }
../../util/tracing/span_inner.go:91 tracing.(*spanInner).Finish {  }
../../util/tracing/span_inner.go:226 tracing.(*spanInner).Tracer { func (s *spanInner) Tracer() *Tracer { }
../../util/tracing/span.go:246 tracing.(*Span).Finish { sp.finishInternal() }
../../kv/kvserver/replica_range_lease.go:480 kvserver.(*pendingLeaseRequest).requestLeaseAsync.func2 { p.cancelLocked() }
../../util/stop/stopper.go:494 stop.(*Stopper).RunAsyncTaskEx.func2 { f(ctx) }

ERROR: exit status 2

1 runs completed, 1 failures, over 1s
context canceled
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM) Parameters in this failure: - TAGS=deadlock - GOFLAGS=-parallel=4

Same failure on other branches

- #73829 ccl/multiregionccl: TestMultiRegionDataDriven failed [C-test-failure O-robot branch-release-21.2]

This test on roachdash | Improve this report!

irfansharif commented 2 years ago

(Ignoring the last tracing failure that Andrei fixed)

These tests have been failing for a while, even before #73876 (which certainly couldn't have helped with its additional asynchrony). Here's one from 21.2: #73829. +cc #75543 (a bazel dup?). Looking at the test directives, we seem make use of arbitrary sleeps, perhaps that's why?

https://github.com/cockroachdb/cockroach/blob/e49415f3b9f9cab3698bfa72da683b1c8f25e6dc/pkg/ccl/multiregionccl/datadriven_test.go#L124-L125

In #75304 we observed the need to sometimes to sometimes retry the query itself to accommodate stale distsender caches, our parallel here being this directive, so perhaps that's not as much of a problem here:

https://github.com/cockroachdb/cockroach/blob/e49415f3b9f9cab3698bfa72da683b1c8f25e6dc/pkg/ccl/multiregionccl/datadriven_test.go#L245

For the "wait-for-zone-config-changes" primitive, we can transfer the leaseholder by issuing a one-off transfer request, which I don't think guarantees that the lease would stay pinned to the destination. Likely a red-herring for these failures, but worth noting I think.

https://github.com/cockroachdb/cockroach/blob/e49415f3b9f9cab3698bfa72da683b1c8f25e6dc/pkg/ccl/multiregionccl/datadriven_test.go#L72-L88


I'm not planning on looking at these tests further unless pushed. +cc @arulajmani / @ajstorm / @nvanbenschoten: should we skip these tests on master? Do we think we'll have the time to polish them up? While they do provide good coverage, I do think it'd take time to make them a bit more robust, and in relation to #73876 at least, I'm not expecting any work here catch bugs as much as ensure this test is flake-proof.

otan commented 2 years ago

hasn't triggered for over a month

irfansharif commented 2 years ago

hasn't triggered for over a month

They have on release-22.1 which I guess Arul's currently assigned to (#77908). I suspect this will just fail again but happy connect the history back to this issue when it happens. This is a pretty long running test, 45s+ at times, if it's not flaking in CI it's also possible because it's just not getting stressed enough. I hope we can either rewrite this test to be faster or rip it out entirely.