cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.05k stars 3.8k forks source link

server: TestDatabasesTablesV2 failed #124395

Closed cockroach-teamcity closed 5 months ago

cockroach-teamcity commented 5 months ago

server.TestDatabasesTablesV2 failed on master @ 93ad913106b6f0f6ec98bc2cfa788ff6d8085bd4:

=== RUN   TestDatabasesTablesV2
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestDatabasesTablesV23023245521
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:144: cluster virtualization disabled in global scope due to issue: #76378 (expected label: C-bug)
    api_v2_sql_schema_test.go:148: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/server/api_v2_sql_schema_test.go:148
            Error:          Received unexpected error:
                            Get "https://127.0.0.1:33161/api/v2/databases/testdb/tables/public.testtable/": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
            Test:           TestDatabasesTablesV2
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestDatabasesTablesV23023245521
--- FAIL: TestDatabasesTablesV2 (54.20s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/obs-prs @cockroachdb/server

This test on roachdash | Improve this report!

Jira issue: CRDB-38868

abarganier commented 5 months ago

This failed last week here, which I chalked up to a flake.

Given the newest failures, I'm going to give this a closer look.

abarganier commented 5 months ago

I was able to reproduce locally under --stress --race.

From what I can tell, the test cluster/server is receiving signals to shutdown before the test completes. The specific call that's failing alone does not seem to be problematic.

It seems that this query is failing to finish executing. When we attempt to read through the result set, it.Next() returns the error:

‹admin-show-grants›: failed to read query result: query execution canceled

Looking at the timestamps in the logs, quite a bit of time passes from the time the test cluster is initialized, until the TableDetails endpoint is invoked. However, the amount of time spent within the TableDetails endpoint before the test cluster begins shutting down is only ~10 seconds, which should not trigger the request timeout.

Seems like the endpoint itself is responding in an acceptable amount of time. However, the amount of time it takes for the test to reach the point where it makes a request seems to be too long.

Plenty of things happen between the test's first interaction with the test cluster and the timeout during the specific call to TableDetails. I'm going to see if I can pinpoint the portion of the test that's taking so long.

abarganier commented 5 months ago

I added some timers to various parts of the test.

api_v2_sql_schema_test.go:85: Table & DB creation time - start: 18:42:56.787315, duration 22.221347291s
api_v2_sql_schema_test.go:97: /databases/ call - start: 18:43:37.780579, duration 1.719976542s
api_v2_sql_schema_test.go:112: /databases/testdb/ call - start: 18:43:39.50957, duration 3.060485459s
api_v2_sql_schema_test.go:125: /databases/testdb/grants/ call - start: 18:43:42.642032, duration 1.945669334s
api_v2_sql_schema_test.go:139: /databases/testdb/tables/ call - start: 18:43:44.588601, duration 2.728289625s
api_v2_sql_schema_test.go:156: /databases/testdb2/tables/ call (should 404) - start: 18:43:47.317876, duration 515.812625ms

...and the cluster log indicating that the cluster has received a signal to shut down:

I240520 18:43:57.957227 51 testutils/testcluster/testcluster.go:148 ⋮ [-] 1191  TestCluster quiescing nodes

22 seconds for creation of DBs/tables/users/roles required by the test feels a bit long, but is that the true problem?

From the first moment the test begins to interact with the test cluster until the test cluster receives a signal to shut down, only around ~1 minute passes.

I'm going to look into the default timeouts and see if this is expected, and/or if we need to increase the test timeout.

dhartunian commented 5 months ago

@abarganier I think it's significant that the test fails right after a test that checks for a 404 Not Found response. Is it possible that the helper we use to retrieve table data leaks resources in the error case? That could cause this odd error result.

This is the handler which looks innocuous: https://github.com/cockroachdb/cockroach/blob/c5e3cf49403c017c370d10d31071244835117c75/pkg/server/api_v2_sql_schema.go#L477-L497

But the tableDetailsHelper is quite hefty and has all sorts of stuff going on.