cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] #78284

Closed cockroach-teamcity closed 2 years ago

cockroach-teamcity commented 2 years ago

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 10e0c5d92f8ef953d6b497b448893bb5044cdd31:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
    follower_reads.go:712,follower_reads.go:373,follower_reads.go:72,test_runner.go:875: Post "http://34.145.50.48:26258/ts/query": EOF

    test_runner.go:1006,test_runner.go:905: test timed out (0s)
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-14044

tbg commented 2 years ago

goroutine 16364430 [select, 593 minutes]:
net/http.(*persistConn).roundTrip(0xc002814360, 0xc000ced940)
    GOROOT/src/net/http/transport.go:2614 +0x97d
net/http.(*Transport).roundTrip(0x9575040, 0xc00367db00)
    GOROOT/src/net/http/transport.go:594 +0x7d1
net/http.(*Transport).RoundTrip(0x30, 0x65a6000)
    GOROOT/src/net/http/roundtrip.go:18 +0x19
net/http.send(0xc00367db00, {0x65a6000, 0x9575040}, {0x503df00, 0x591c01, 0x0})
    GOROOT/src/net/http/client.go:252 +0x5d8
net/http.(*Client).send(0xc003e17830, 0xc00367db00, {0x0, 0x10000000c, 0x0})
    GOROOT/src/net/http/client.go:176 +0x9b
net/http.(*Client).do(0xc003e17830, 0xc00367db00)
    GOROOT/src/net/http/client.go:725 +0x908
net/http.(*Client).Do(...)
    GOROOT/src/net/http/client.go:593
github.com/cockroachdb/cockroach/pkg/util/httputil.doJSONRequest({{0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0}, 0xc00367db00, {0x66b9da0, 0xc003254918})
    github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:109 +0x2d1
github.com/cockroachdb/cockroach/pkg/util/httputil.PostJSON({{0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0}, {0xc00273b500, 0x22}, {0x66b9d50, 0xc003e17740}, ...)
    github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:74 +0x16f
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.verifyHighFollowerReadRatios({0x6648bb0, 0xc005a44b00}, {0x674bc28, 0xc0007328c0}, {0x676d3e8, 0xc003e24c80}, 0x1, {0x38b2fcc4, 0xed9cb8640, 0x0}, ...)
    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:711 +0x7ef
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runFollowerReadsTest({0x6648bb0, 0xc005a44b00}, {0x674bc28, 0xc0007328c0}, {0x676d3e8, 0xc003e24c80}, {0x1, {0x510203d, 0x8}, {0x50f894b, ...}, ...}, ...)
    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:373 +0x15ed
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerFollowerReads.func1.1({0x6648bb0, 0xc005a44b00}, {0x674bc28, 0xc0007328c0}, {0x676d3e8, 0xc003e24c80})
    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:72 +0x3a6
tbg commented 2 years ago

It's hard to tell from the test setup (since there's a lot going on), but a simple explanation would be that the timeseries query above hangs because well, we lost quorum, so why can you query timeseries which are also stored in the KV store? Timeseries are 3x replicated and we're killing three out of six nodes. Unless the test is being clever about making sure the timeseries are in the surviving region, this kind of problem is expected.

We're definitely seeing unexpected request durations on those ranges, indicating that they are indeed unavailable: ``` $ cockroach debug merge-logs logs/*.unredacted | grep -E 'r4[^0-9]' | grep 'have been' teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:33.477331 10085 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1500 slow range RPC: have been waiting 60.14s (58 attempts) for RPC Get [‹/System/tsd/cr.node.sql.select.count/4/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.node.sql.select.count/4/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open› teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:33.682050 10086 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1502 slow range RPC: have been waiting 60.35s (59 attempts) for RPC Get [‹/System/tsd/cr.store.follower_reads.success_count/5/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.store.follower_reads.success_count/5/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: failed to connect to n1 at 10.142.0.194:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.142.0.194:26257: connect: connection refused"› teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:34.254075 10089 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1504 slow range RPC: have been waiting 60.92s (59 attempts) for RPC Get [‹/System/tsd/cr.node.sql.select.count/6/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.node.sql.select.count/6/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open› teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:34.254099 10087 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1505 slow range RPC: have been waiting 60.92s (59 attempts) for RPC Get [‹/System/tsd/cr.node.sql.select.count/5/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.node.sql.select.count/5/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open› teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:34.430622 10084 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1506 slow range RPC: have been waiting 61.10s (59 attempts) for RPC Get [‹/System/tsd/cr.store.follower_reads.success_count/4/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.store.follower_reads.success_count/4/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open› teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:34.430677 10088 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1507 slow range RPC: have been waiting 61.09s (60 attempts) for RPC Get [‹/System/tsd/cr.store.follower_reads.success_count/6/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.store.follower_reads.success_count/6/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open› teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 18:55:23.194077 1049484 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,client=35.185.76.110:33742,user=root] 152262 slow range RPC: have been waiting 60.45s (59 attempts) for RPC CheckConsistency [‹/System/tsd›,‹/System/"tse"›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open› ```

Replica circuit breakers should've let this test fail "gracefully" with a loss of quorum error. However, the SHA hadn't picked up https://github.com/cockroachdb/cockroach/pull/76146 yet, so breakers were disabled.

Going to toss this over to KV for further investigation.

cockroach-teamcity commented 2 years ago

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 834eaa0e83350486830867b5edd6e8809b52aa55:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
    follower_reads.go:712,follower_reads.go:373,follower_reads.go:72,test_runner.go:875: Post "http://34.82.187.133:26258/ts/query": EOF

    test_runner.go:1006,test_runner.go:905: test timed out (0s)
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #78444 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] [C-test-failure O-roachtest O-robot T-kv branch-release-22.1 release-blocker]

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 years ago

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on master @ 29716850b181718594663889ddb5f479fef7a305:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
    cluster.go:1868,follower_reads.go:64,test_runner.go:875: one or more parallel execution failure
        (1) attached stack trace
          -- stack trace:
          | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE
          |     github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2042
          | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Parallel
          |     github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1923
          | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start
          |     github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:167
          | github.com/cockroachdb/cockroach/pkg/roachprod.Start
          |     github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:660
          | main.(*clusterImpl).StartE
          |     main/pkg/cmd/roachtest/cluster.go:1826
          | main.(*clusterImpl).Start
          |     main/pkg/cmd/roachtest/cluster.go:1867
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerFollowerReads.func1.1
          |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:64
          | main.(*testRunner).runTest.func2
          |     main/pkg/cmd/roachtest/test_runner.go:875
          | runtime.goexit
          |     GOROOT/src/runtime/asm_amd64.s:1581
        Wraps: (2) one or more parallel execution failure
        Error types: (1) *withstack.withStack (2) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #78444 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] [C-test-failure O-roachtest O-robot T-kv branch-release-22.1 release-blocker]

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 years ago

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 29716850b181718594663889ddb5f479fef7a305:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
    cluster.go:1868,follower_reads.go:64,test_runner.go:875: one or more parallel execution failure
        (1) attached stack trace
          -- stack trace:
          | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE
          |     github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2042
          | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Parallel
          |     github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1923
          | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start
          |     github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:167
          | github.com/cockroachdb/cockroach/pkg/roachprod.Start
          |     github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:660
          | main.(*clusterImpl).StartE
          |     main/pkg/cmd/roachtest/cluster.go:1826
          | main.(*clusterImpl).Start
          |     main/pkg/cmd/roachtest/cluster.go:1867
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerFollowerReads.func1.1
          |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:64
          | main.(*testRunner).runTest.func2
          |     main/pkg/cmd/roachtest/test_runner.go:875
          | runtime.goexit
          |     GOROOT/src/runtime/asm_amd64.s:1581
        Wraps: (2) one or more parallel execution failure
        Error types: (1) *withstack.withStack (2) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #78444 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] [C-test-failure O-roachtest O-robot T-kv branch-release-22.1 release-blocker]

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 years ago

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 63ea9139e2ca996e38b5fe7c7b43a97e625242f5:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
    follower_reads.go:712,follower_reads.go:373,follower_reads.go:72,test_runner.go:875: Post "http://35.233.229.131:26258/ts/query": EOF

    test_runner.go:1006,test_runner.go:905: test timed out (0s)
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #78444 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] [C-test-failure O-roachtest O-robot T-kv branch-release-22.1 release-blocker]

This test on roachdash | Improve this report!

nvanbenschoten commented 2 years ago

Unless the test is being clever about making sure the timeseries are in the surviving region, this kind of problem is expected.

We are attempting to be smart about this. See this code:

https://github.com/cockroachdb/cockroach/blob/7506753ab7b228c99534f539c06b0715ec560890/pkg/cmd/roachtest/tests/follower_reads.go#L529-L566

That logic should wait until all ranges other than the range in the database with ZONE survivability have upreplicated across regions.

But this isn't what we see in the logs you posted. Notice the range descriptor in r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4].

I reproduced this and confirmed that the unavailable timeseries range never achieved region survivability:

Screen Shot 2022-04-12 at 3 25 27 PM Screen Shot 2022-04-12 at 3 25 44 PM

Something must be going wrong here with the replication reports. I wonder if it's related to async span config reconciliation in some form.

cockroach-teamcity commented 2 years ago

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ cc07b5e7e670097560cb8412b380484773df1e96:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
    follower_reads.go:718,follower_reads.go:379,follower_reads.go:73,test_runner.go:875: Post "http://34.82.76.44:26258/ts/query": EOF

    test_runner.go:1006,test_runner.go:905: test timed out (0s)
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #78444 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] [C-test-failure GA-blocker O-roachtest O-robot T-kv branch-release-22.1]

This test on roachdash | Improve this report!

ajwerner commented 2 years ago

I'm pretty sure I've got this one. My suspicion is that it was caused by https://github.com/cockroachdb/cockroach/pull/76279 which meant that the system config was not immediately available when we first set the cluster setting to trigger the report, but it's just a hunch really. I'm not totally clear on why it was okay to only look at one report before this change. Maybe it's that report generation timing was such that we always did one iteration of the retry loop and it was there and now, for whatever reason, there's some timing thing involving the rangefeed that means that we have to do more than one iteration.

What I do know is that when I added code to print out the table state for a bunch of the tables inside the code to check on the critical localities, but before we actually did the scan to check on them, that it ran 60 times without failing where in the past I was getting 1-2/20. That lead me to wonder if we just weren't waiting for the right thing. Indeed it seems like we weren't. We were just waiting for one report to be written, but there are 3 reports in total and we write the critical localities report second.

I'm running it more, I'm at 55 successes with https://github.com/cockroachdb/cockroach/pull/79977 and it feels right to me. I've removed the release blocker label. My working theory is just that changed the timing and exposed the bug. I don't feel super eager to prove this out further right now, but I'm pretty happy with the answer of the moment.