Closed cockroach-teamcity closed 5 months ago
This failed last week here, which I chalked up to a flake.
Given the newest failures, I'm going to give this a closer look.
I was able to reproduce locally under --stress --race
.
From what I can tell, the test cluster/server is receiving signals to shutdown before the test completes. The specific call that's failing alone does not seem to be problematic.
It seems that this query is failing to finish executing. When we attempt to read through the result set, it.Next()
returns the error:
‹admin-show-grants›: failed to read query result: query execution canceled
Looking at the timestamps in the logs, quite a bit of time passes from the time the test cluster is initialized, until the TableDetails endpoint is invoked. However, the amount of time spent within the TableDetails endpoint before the test cluster begins shutting down is only ~10 seconds, which should not trigger the request timeout.
admin-show-grants
queryadmin-show-grants
query returnsSeems like the endpoint itself is responding in an acceptable amount of time. However, the amount of time it takes for the test to reach the point where it makes a request seems to be too long.
Plenty of things happen between the test's first interaction with the test cluster and the timeout during the specific call to TableDetails
. I'm going to see if I can pinpoint the portion of the test that's taking so long.
I added some timers to various parts of the test.
api_v2_sql_schema_test.go:85: Table & DB creation time - start: 18:42:56.787315, duration 22.221347291s
api_v2_sql_schema_test.go:97: /databases/ call - start: 18:43:37.780579, duration 1.719976542s
api_v2_sql_schema_test.go:112: /databases/testdb/ call - start: 18:43:39.50957, duration 3.060485459s
api_v2_sql_schema_test.go:125: /databases/testdb/grants/ call - start: 18:43:42.642032, duration 1.945669334s
api_v2_sql_schema_test.go:139: /databases/testdb/tables/ call - start: 18:43:44.588601, duration 2.728289625s
api_v2_sql_schema_test.go:156: /databases/testdb2/tables/ call (should 404) - start: 18:43:47.317876, duration 515.812625ms
...and the cluster log indicating that the cluster has received a signal to shut down:
I240520 18:43:57.957227 51 testutils/testcluster/testcluster.go:148 ⋮ [-] 1191 TestCluster quiescing nodes
22 seconds for creation of DBs/tables/users/roles required by the test feels a bit long, but is that the true problem?
From the first moment the test begins to interact with the test cluster until the test cluster receives a signal to shut down, only around ~1 minute passes.
I'm going to look into the default timeouts and see if this is expected, and/or if we need to increase the test timeout.
@abarganier I think it's significant that the test fails right after a test that checks for a 404 Not Found response. Is it possible that the helper we use to retrieve table data leaks resources in the error case? That could cause this odd error result.
This is the handler which looks innocuous: https://github.com/cockroachdb/cockroach/blob/c5e3cf49403c017c370d10d31071244835117c75/pkg/server/api_v2_sql_schema.go#L477-L497
But the tableDetailsHelper
is quite hefty and has all sorts of stuff going on.
server.TestDatabasesTablesV2 failed on master @ 93ad913106b6f0f6ec98bc2cfa788ff6d8085bd4:
Parameters:
attempt=1
race=true
run=2
shard=5
Help
See also: How To Investigate a Go Test Failure (internal)
/cc @cockroachdb/obs-prs @cockroachdb/serverThis test on roachdash | Improve this report!
Jira issue: CRDB-38868