cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.01k stars 3.79k forks source link

roachtest: acceptance/multitenant failed #132068

Open cockroach-teamcity opened 1 week ago

cockroach-teamcity commented 1 week ago

roachtest.acceptance/multitenant failed with artifacts on release-23.2.13-rc @ 74520af5a4280d723e7434684cca170c6c7b9d5a:

(cluster.go:2110).StartServiceForVirtualCluster: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/acceptance/multitenant/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-42814

blathers-crl[bot] commented 1 week ago

cc @cockroachdb/test-eng

renatolabs commented 1 week ago

This is on a 23.2 release branch, and this test hasn't been updated on this branch. The reason Start failed here is because the tenant "randomly" decided to shutdown:

I241007 07:23:28.329466 49 1@server/tenant.go:298 ⋮ [nsql?] 11  server starting for tenant "2"
I241007 07:23:28.330835 50 util/cidr/cidr.go:304 ⋮ [T2,nsql?] 12  CIDR lookup updated with 0 destinations
I241007 07:23:28.446907 1 1@cli/start.go:997 ⋮ [T2,nsql?] 13  initiating hard shutdown of server
I241007 07:23:28.446964 1 1@cli/start.go:1075 ⋮ [T2,nsql?] 14  too early to drain; used hard shutdown instead

I don't see any errors in the logs (system or tenant), so I don't quite understand what happened.

I'm not sure Test Eng can do anything tangible with this. @cockroachdb/disaster-recovery @cockroachdb/multi-tenant @cockroachdb/server Is there a way to find out why the tenant decided to shutdown shortly after it started?

renatolabs commented 1 week ago

Interestingly, these tests also failed on 23.2 (not the rc branch) with the same symptoms: #132067, #132066, #132065, #132064. These failures also all happened on Azure 🤔.

This points to some kind of weird cloud-related issue, but I still find confusing that there are no logs for the error that caused the tenant to decide to shut down.

msbutler commented 4 days ago

I'm throwing this over to dbserver to assess if there is additional logging we could add to understand what happened.