cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.15k stars 3.81k forks source link

roachtest: multitenant/distsql: test needs an audit #121260

Open cockroach-teamcity opened 7 months ago

cockroach-teamcity commented 7 months ago

roachtest.multitenant/distsql/instances=20/bundle=on/timeout=0 failed with artifacts on master @ 460707194644edbf4c6258d09b46d9877c8eefa6:

(assertions.go:333).Fail: 
    Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_distsql.go:125
                                main/pkg/cmd/roachtest/monitor.go:120
                                golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78
                                src/runtime/asm_amd64.s:1695
    Error:          Received unexpected error:
                    dial tcp 34.75.192.56:29001: connect: connection refused
    Test:           multitenant/distsql/instances=20/bundle=on/timeout=0
    Messages:       instance idx = 1, iter = 28
(require.go:1360).NoError: FailNow called
(assertions.go:333).Fail: 
    Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_distsql.go:210
                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_distsql.go:47
                                main/pkg/cmd/roachtest/test_runner.go:1161
                                src/runtime/asm_amd64.s:1695
    Error:          Received unexpected error:
                    failed to retrieve statement diagnostics bundles: failed to connect to ``host=34.23.163.214 user=roachprod database=``: dial error (dial tcp 34.23.163.214:29001: operation was canceled)
                    (1) attached stack trace
                      -- stack trace:
                      | github.com/cockroachdb/cockroach/pkg/cli/clisqlclient.StmtDiagListBundles
                      |     github.com/cockroachdb/cockroach/pkg/cli/clisqlclient/statement_diag.go:40
                      | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runMultiTenantDistSQL
                      |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_distsql.go:209
                      | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerMultiTenantDistSQL.func1
                      |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_distsql.go:47
                      | main.(*testRunner).runTest.func2
                      |     main/pkg/cmd/roachtest/test_runner.go:1161
                      | runtime.goexit
                      |     src/runtime/asm_amd64.s:1695
                    Wraps: (2) failed to retrieve statement diagnostics bundles
                    Wraps: (3) failed to connect to ``host=34.23.163.214 user=roachprod database=``: dial error (dial tcp 34.23.163.214:29001: operation was canceled)
                    Wraps: (4) dial tcp 34.23.163.214:29001
                    Wraps: (5) operation was canceled
                    Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *pgconn.connectError (4) *net.OpError (5) net.canceledError
    Test:           multitenant/distsql/instances=20/bundle=on/timeout=0
(require.go:1360).NoError: FailNow called
test artifacts and logs in: /artifacts/multitenant/distsql/instances=20/bundle=on/timeout=0/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-37158

yuzefovich commented 7 months ago

Around 05:59:27 we start seeing network connectivity issues. For example, from pod 14 to pod 8

W240328 05:59:40.327830 1161 sql/stats/automatic_stats.go:880 ⋮ [T3,Vtest-tenant,nsql14] 84  failed to create statistics on table 104: ‹create-stats›: failed to connect to n8 at ‹10.142.1.68:29002›: initial connection heartbeat failed: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.142.1.68:29002: connect: connection refused"› [code 14/Unavailable]

Here are the last messages we have for pod 8 (logs-test-tenant-1/4.unredacted):

I240328 05:58:53.521336 120 kv/kvclient/kvtenant/setting_overrides.go:234 ⋮ [T3,Vtest-tenant,nsql8,tenant-connector] 76  received 3 setting overrides with precedence ‹TENANT_SPECIFIC_OVERRIDES› (incremental=false)
I240328 05:58:53.521374 120 kv/kvclient/kvtenant/setting_overrides.go:138 ⋮ [T3,Vtest-tenant,nsql8,tenant-connector] 77  received initial tenant settings
I240328 05:58:53.521410 141 server/settingswatcher/settings_watcher.go:514 ⋮ [T3,Vtest-tenant,nsql8] 78  updating storage cluster cached version from: 23.2-upgrading-to-24.1-step-024 to: 23.2-upgrading-to-24.1-step-024
W240328 05:59:27.471096 4160 kv/kvclient/kvcoord/dist_sender.go:2719 ⋮ [T3,Vtest-tenant,nsql8,client=34.86.0.62:45324,hostssl,user=‹roachprod›] 79  slow replica RPC: have been waiting 10.89s (0 attempts) for RPC ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5501›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5502›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5503›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5504›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5505›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5506›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5507›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5508›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5509›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5510›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5511›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5512›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5513›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5514›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5515›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5516›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5517›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5518›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5519›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5520›/‹0›],... 76 skipped ..., ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5597›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5598›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5599›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5600›/‹0›], EndTxn(parallel commit) [/Tenant/3/Table/104/1/‹7›/‹5501›/‹0›], [txn: 60e0c6c4], [can-forward-ts] to replica (n2,s2):3; resp: context canceled
E240328 05:59:28.172324 839 2@rpc/peer.go:577 ⋮ [T3,Vtest-tenant,nsql8,rnode=2,raddr=‹10.142.1.73:29001›,class=default,rpc] 80  disconnected (was healthy for 2m8.134s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
W240328 05:59:30.055258 180 sql/sqlliveness/slinstance/slinstance.go:326 ⋮ [T3,Vtest-tenant,nsql8] 81  exiting heartbeat loop
W240328 05:59:30.055289 180 sql/sqlliveness/slinstance/slinstance.go:313 ⋮ [T3,Vtest-tenant,nsql8] 82  exiting heartbeat loop with error: session record deleted
I240328 05:59:30.055323 295 cli/start.go:876 ⋮ [T3,Vtest-tenant,nsql8] 83  server requesting spontaneous shutdown: sql liveness session deleted
I240328 05:59:30.057686 1 1@cli/start.go:1035 ⋮ [T3,Vtest-tenant,nsql8] 84  initiating hard shutdown of server
I240328 05:59:30.057800 251 sql/stats/automatic_stats.go:608 ⋮ [T3,Vtest-tenant,nsql8] 85  quiescing stats garbage collector
I240328 05:59:30.057862 115 server/start_listen.go:110 ⋮ [T3,Vtest-tenant,nsql8] 86  server shutting down: instructing cmux to stop accepting
W240328 05:59:30.058007 119 kv/kvclient/kvtenant/connector.go:416 ⋮ [T3,Vtest-tenant,nsql8,tenant-connector] 87  error consuming GossipSubscription RPC: recv msg error: grpc: ‹context canceled› [code 1/Canceled]
W240328 05:59:30.058337 120 kv/kvclient/kvtenant/setting_overrides.go:66 ⋮ [T3,Vtest-tenant,nsql8,tenant-connector] 88  error consuming TenantSettings RPC: recv msg error: grpc: ‹error reading from server: read tcp 10.142.1.68:58156->34.75.192.56:29000: use of closed network connection› [code 14/Unavailable]
W240328 05:59:30.065986 4943 server/server_sql.go:1707 ⋮ [T3,Vtest-tenant,nsql8] 89  server shutdown without a prior graceful drain
I240328 05:59:30.066059 1 1@cli/start.go:1106 ⋮ [T3,Vtest-tenant,nsql8] 90  server drained and shutdown completed

Not sure what to make of it.

Logs are also full of messages like

teamcity-14594861-1711603819-09-n4cpu4-0002> I240328 05:59:16.431516 34627 kv/kvserver/replica_command.go:473 ⋮ [T1,Vsystem,n2,split,s2,r88/3:‹×›] 198 initiating a split of this range at key /Tenant/3/Table/104/1/‹×›/‹×› [r90] (303 MiB above threshold size 98 KiB)
teamcity-14594861-1711603819-09-n4cpu4-0002> I240328 05:59:16.730284 34614 kv/kvserver/replica_command.go:473 ⋮ [T1,Vsystem,n2,split,s2,r89/3:‹×›] 199 initiating a split of this range at key /Tenant/3/Table/104/1/‹×›/‹×› [r91] (304 MiB above threshold size 98 KiB)

This is because the test issues this command:

ALTER TABLE t CONFIGURE ZONE USING range_min_bytes = 1000,range_max_bytes = 100000;

It also spins up 20 SQL instances, and a separate goroutine for each SQL instance that continuously INSERTs a row roughly 8KiB in size.

I wonder whether the storage cluster simply got overwhelmed with the sheer number of ranges (there are 8373 messages matching "above threshold size" pattern in the logs). I think that the test need to be audited to not produce this many ranges and stop INSERT goroutines at some point.

yuzefovich commented 7 months ago

For now, I'll move it to the backlog since I don't think we've seen a similar reproduction before.

yuzefovich commented 4 months ago

Also see https://github.com/cockroachdb/cockroach/issues/126313#issuecomment-2229464193 for more insight.