Open cockroach-teamcity opened 7 months ago
Around 05:59:27 we start seeing network connectivity issues. For example, from pod 14 to pod 8
W240328 05:59:40.327830 1161 sql/stats/automatic_stats.go:880 ⋮ [T3,Vtest-tenant,nsql14] 84 failed to create statistics on table 104: ‹create-stats›: failed to connect to n8 at ‹10.142.1.68:29002›: initial connection heartbeat failed: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.142.1.68:29002: connect: connection refused"› [code 14/Unavailable]
Here are the last messages we have for pod 8 (logs-test-tenant-1/4.unredacted
):
I240328 05:58:53.521336 120 kv/kvclient/kvtenant/setting_overrides.go:234 ⋮ [T3,Vtest-tenant,nsql8,tenant-connector] 76 received 3 setting overrides with precedence ‹TENANT_SPECIFIC_OVERRIDES› (incremental=false)
I240328 05:58:53.521374 120 kv/kvclient/kvtenant/setting_overrides.go:138 ⋮ [T3,Vtest-tenant,nsql8,tenant-connector] 77 received initial tenant settings
I240328 05:58:53.521410 141 server/settingswatcher/settings_watcher.go:514 ⋮ [T3,Vtest-tenant,nsql8] 78 updating storage cluster cached version from: 23.2-upgrading-to-24.1-step-024 to: 23.2-upgrading-to-24.1-step-024
W240328 05:59:27.471096 4160 kv/kvclient/kvcoord/dist_sender.go:2719 ⋮ [T3,Vtest-tenant,nsql8,client=34.86.0.62:45324,hostssl,user=‹roachprod›] 79 slow replica RPC: have been waiting 10.89s (0 attempts) for RPC ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5501›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5502›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5503›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5504›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5505›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5506›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5507›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5508›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5509›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5510›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5511›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5512›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5513›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5514›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5515›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5516›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5517›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5518›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5519›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5520›/‹0›],... 76 skipped ..., ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5597›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5598›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5599›/‹0›], ConditionalPut [/Tenant/3/Table/104/1/‹7›/‹5600›/‹0›], EndTxn(parallel commit) [/Tenant/3/Table/104/1/‹7›/‹5501›/‹0›], [txn: 60e0c6c4], [can-forward-ts] to replica (n2,s2):3; resp: context canceled
E240328 05:59:28.172324 839 2@rpc/peer.go:577 ⋮ [T3,Vtest-tenant,nsql8,rnode=2,raddr=‹10.142.1.73:29001›,class=default,rpc] 80 disconnected (was healthy for 2m8.134s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
W240328 05:59:30.055258 180 sql/sqlliveness/slinstance/slinstance.go:326 ⋮ [T3,Vtest-tenant,nsql8] 81 exiting heartbeat loop
W240328 05:59:30.055289 180 sql/sqlliveness/slinstance/slinstance.go:313 ⋮ [T3,Vtest-tenant,nsql8] 82 exiting heartbeat loop with error: session record deleted
I240328 05:59:30.055323 295 cli/start.go:876 ⋮ [T3,Vtest-tenant,nsql8] 83 server requesting spontaneous shutdown: sql liveness session deleted
I240328 05:59:30.057686 1 1@cli/start.go:1035 ⋮ [T3,Vtest-tenant,nsql8] 84 initiating hard shutdown of server
I240328 05:59:30.057800 251 sql/stats/automatic_stats.go:608 ⋮ [T3,Vtest-tenant,nsql8] 85 quiescing stats garbage collector
I240328 05:59:30.057862 115 server/start_listen.go:110 ⋮ [T3,Vtest-tenant,nsql8] 86 server shutting down: instructing cmux to stop accepting
W240328 05:59:30.058007 119 kv/kvclient/kvtenant/connector.go:416 ⋮ [T3,Vtest-tenant,nsql8,tenant-connector] 87 error consuming GossipSubscription RPC: recv msg error: grpc: ‹context canceled› [code 1/Canceled]
W240328 05:59:30.058337 120 kv/kvclient/kvtenant/setting_overrides.go:66 ⋮ [T3,Vtest-tenant,nsql8,tenant-connector] 88 error consuming TenantSettings RPC: recv msg error: grpc: ‹error reading from server: read tcp 10.142.1.68:58156->34.75.192.56:29000: use of closed network connection› [code 14/Unavailable]
W240328 05:59:30.065986 4943 server/server_sql.go:1707 ⋮ [T3,Vtest-tenant,nsql8] 89 server shutdown without a prior graceful drain
I240328 05:59:30.066059 1 1@cli/start.go:1106 ⋮ [T3,Vtest-tenant,nsql8] 90 server drained and shutdown completed
Not sure what to make of it.
Logs are also full of messages like
teamcity-14594861-1711603819-09-n4cpu4-0002> I240328 05:59:16.431516 34627 kv/kvserver/replica_command.go:473 ⋮ [T1,Vsystem,n2,split,s2,r88/3:‹×›] 198 initiating a split of this range at key /Tenant/3/Table/104/1/‹×›/‹×› [r90] (303 MiB above threshold size 98 KiB)
teamcity-14594861-1711603819-09-n4cpu4-0002> I240328 05:59:16.730284 34614 kv/kvserver/replica_command.go:473 ⋮ [T1,Vsystem,n2,split,s2,r89/3:‹×›] 199 initiating a split of this range at key /Tenant/3/Table/104/1/‹×›/‹×› [r91] (304 MiB above threshold size 98 KiB)
This is because the test issues this command:
ALTER TABLE t CONFIGURE ZONE USING range_min_bytes = 1000,range_max_bytes = 100000;
It also spins up 20 SQL instances, and a separate goroutine for each SQL instance that continuously INSERTs a row roughly 8KiB in size.
I wonder whether the storage cluster simply got overwhelmed with the sheer number of ranges (there are 8373 messages matching "above threshold size" pattern in the logs). I think that the test need to be audited to not produce this many ranges and stop INSERT goroutines at some point.
For now, I'll move it to the backlog since I don't think we've seen a similar reproduction before.
Also see https://github.com/cockroachdb/cockroach/issues/126313#issuecomment-2229464193 for more insight.
roachtest.multitenant/distsql/instances=20/bundle=on/timeout=0 failed with artifacts on master @ 460707194644edbf4c6258d09b46d9877c8eefa6:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=4
ROACHTEST_encrypted=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
/cc @cockroachdb/sql-queriesThis test on roachdash | Improve this report!
Jira issue: CRDB-37158