Closed cockroach-teamcity closed 2 years ago
goroutine 16364430 [select, 593 minutes]:
net/http.(*persistConn).roundTrip(0xc002814360, 0xc000ced940)
GOROOT/src/net/http/transport.go:2614 +0x97d
net/http.(*Transport).roundTrip(0x9575040, 0xc00367db00)
GOROOT/src/net/http/transport.go:594 +0x7d1
net/http.(*Transport).RoundTrip(0x30, 0x65a6000)
GOROOT/src/net/http/roundtrip.go:18 +0x19
net/http.send(0xc00367db00, {0x65a6000, 0x9575040}, {0x503df00, 0x591c01, 0x0})
GOROOT/src/net/http/client.go:252 +0x5d8
net/http.(*Client).send(0xc003e17830, 0xc00367db00, {0x0, 0x10000000c, 0x0})
GOROOT/src/net/http/client.go:176 +0x9b
net/http.(*Client).do(0xc003e17830, 0xc00367db00)
GOROOT/src/net/http/client.go:725 +0x908
net/http.(*Client).Do(...)
GOROOT/src/net/http/client.go:593
github.com/cockroachdb/cockroach/pkg/util/httputil.doJSONRequest({{0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0}, 0xc00367db00, {0x66b9da0, 0xc003254918})
github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:109 +0x2d1
github.com/cockroachdb/cockroach/pkg/util/httputil.PostJSON({{0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0}, {0xc00273b500, 0x22}, {0x66b9d50, 0xc003e17740}, ...)
github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:74 +0x16f
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.verifyHighFollowerReadRatios({0x6648bb0, 0xc005a44b00}, {0x674bc28, 0xc0007328c0}, {0x676d3e8, 0xc003e24c80}, 0x1, {0x38b2fcc4, 0xed9cb8640, 0x0}, ...)
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:711 +0x7ef
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runFollowerReadsTest({0x6648bb0, 0xc005a44b00}, {0x674bc28, 0xc0007328c0}, {0x676d3e8, 0xc003e24c80}, {0x1, {0x510203d, 0x8}, {0x50f894b, ...}, ...}, ...)
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:373 +0x15ed
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerFollowerReads.func1.1({0x6648bb0, 0xc005a44b00}, {0x674bc28, 0xc0007328c0}, {0x676d3e8, 0xc003e24c80})
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:72 +0x3a6
It's hard to tell from the test setup (since there's a lot going on), but a simple explanation would be that the timeseries query above hangs because well, we lost quorum, so why can you query timeseries which are also stored in the KV store? Timeseries are 3x replicated and we're killing three out of six nodes. Unless the test is being clever about making sure the timeseries are in the surviving region, this kind of problem is expected.
Replica circuit breakers should've let this test fail "gracefully" with a loss of quorum error. However, the SHA hadn't picked up https://github.com/cockroachdb/cockroach/pull/76146 yet, so breakers were disabled.
Going to toss this over to KV for further investigation.
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 834eaa0e83350486830867b5edd6e8809b52aa55:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
follower_reads.go:712,follower_reads.go:373,follower_reads.go:72,test_runner.go:875: Post "http://34.82.187.133:26258/ts/query": EOF
test_runner.go:1006,test_runner.go:905: test timed out (0s)
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #78444 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] [C-test-failure O-roachtest O-robot T-kv branch-release-22.1 release-blocker]
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on master @ 29716850b181718594663889ddb5f479fef7a305:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
cluster.go:1868,follower_reads.go:64,test_runner.go:875: one or more parallel execution failure
(1) attached stack trace
-- stack trace:
| github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE
| github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2042
| github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Parallel
| github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1923
| github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start
| github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:167
| github.com/cockroachdb/cockroach/pkg/roachprod.Start
| github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:660
| main.(*clusterImpl).StartE
| main/pkg/cmd/roachtest/cluster.go:1826
| main.(*clusterImpl).Start
| main/pkg/cmd/roachtest/cluster.go:1867
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerFollowerReads.func1.1
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:64
| main.(*testRunner).runTest.func2
| main/pkg/cmd/roachtest/test_runner.go:875
| runtime.goexit
| GOROOT/src/runtime/asm_amd64.s:1581
Wraps: (2) one or more parallel execution failure
Error types: (1) *withstack.withStack (2) *errutil.leafError
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #78444 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] [C-test-failure O-roachtest O-robot T-kv branch-release-22.1 release-blocker]
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 29716850b181718594663889ddb5f479fef7a305:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
cluster.go:1868,follower_reads.go:64,test_runner.go:875: one or more parallel execution failure
(1) attached stack trace
-- stack trace:
| github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE
| github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2042
| github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Parallel
| github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1923
| github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start
| github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:167
| github.com/cockroachdb/cockroach/pkg/roachprod.Start
| github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:660
| main.(*clusterImpl).StartE
| main/pkg/cmd/roachtest/cluster.go:1826
| main.(*clusterImpl).Start
| main/pkg/cmd/roachtest/cluster.go:1867
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerFollowerReads.func1.1
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:64
| main.(*testRunner).runTest.func2
| main/pkg/cmd/roachtest/test_runner.go:875
| runtime.goexit
| GOROOT/src/runtime/asm_amd64.s:1581
Wraps: (2) one or more parallel execution failure
Error types: (1) *withstack.withStack (2) *errutil.leafError
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #78444 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] [C-test-failure O-roachtest O-robot T-kv branch-release-22.1 release-blocker]
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 63ea9139e2ca996e38b5fe7c7b43a97e625242f5:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
follower_reads.go:712,follower_reads.go:373,follower_reads.go:72,test_runner.go:875: Post "http://35.233.229.131:26258/ts/query": EOF
test_runner.go:1006,test_runner.go:905: test timed out (0s)
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #78444 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] [C-test-failure O-roachtest O-robot T-kv branch-release-22.1 release-blocker]
Unless the test is being clever about making sure the timeseries are in the surviving region, this kind of problem is expected.
We are attempting to be smart about this. See this code:
That logic should wait until all ranges other than the range in the database with ZONE survivability have upreplicated across regions.
But this isn't what we see in the logs you posted. Notice the range descriptor in r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]
.
I reproduced this and confirmed that the unavailable timeseries range never achieved region survivability:
Something must be going wrong here with the replication reports. I wonder if it's related to async span config reconciliation in some form.
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ cc07b5e7e670097560cb8412b380484773df1e96:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
follower_reads.go:718,follower_reads.go:379,follower_reads.go:73,test_runner.go:875: Post "http://34.82.76.44:26258/ts/query": EOF
test_runner.go:1006,test_runner.go:905: test timed out (0s)
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #78444 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] [C-test-failure GA-blocker O-roachtest O-robot T-kv branch-release-22.1]
I'm pretty sure I've got this one. My suspicion is that it was caused by https://github.com/cockroachdb/cockroach/pull/76279 which meant that the system config was not immediately available when we first set the cluster setting to trigger the report, but it's just a hunch really. I'm not totally clear on why it was okay to only look at one report before this change. Maybe it's that report generation timing was such that we always did one iteration of the retry loop and it was there and now, for whatever reason, there's some timing thing involving the rangefeed that means that we have to do more than one iteration.
What I do know is that when I added code to print out the table state for a bunch of the tables inside the code to check on the critical localities, but before we actually did the scan to check on them, that it ran 60 times without failing where in the past I was getting 1-2/20. That lead me to wonder if we just weren't waiting for the right thing. Indeed it seems like we weren't. We were just waiting for one report to be written, but there are 3 reports in total and we write the critical localities report second.
I'm running it more, I'm at 55 successes with https://github.com/cockroachdb/cockroach/pull/79977 and it feels right to me. I've removed the release blocker label. My working theory is just that changed the timing and exposed the bug. I don't feel super eager to prove this out further right now, but I'm pretty happy with the answer of the moment.
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 10e0c5d92f8ef953d6b497b448893bb5044cdd31:
Help
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
/cc @cockroachdb/kv-triage
This test on roachdash | Improve this report!
Jira issue: CRDB-14044