cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.64k stars 3.71k forks source link

sql/instancestorage: TestRefreshSession tries to use logic that is not copacetic #109410

Open knz opened 11 months ago

knz commented 11 months ago

Describe the problem

When pointing TestRefreshSession to a secondary tenant, we get the following error:

    sql_runner.go:322: error executing 'SELECT count(*) FROM system.sql_instances WHERE id = 1 AND session_id <> decode('010180bafc8a4b9cfa441db9fee7401e7791f6', 'hex')': pq: no healthy sql instances available for planning

Which is hardly surprising, given the test removes the rows just before.

It's unclear what the test is trying to do -- removing the rows in that table will block SQL planning, which will prevent further SQL.

Expected behavior

The test works equally well when pointed to a secondary tenant.

cc @dt

Jira issue: CRDB-30910

Epic CRDB-26687

rafiss commented 11 months ago

@dt do you have additional context for this test to help us understand what we should do here?

dt commented 11 months ago

I think this test is supposed to observe the instance row be recreated by the sql server after a lease hiccup.

I thought that distsql was supposed to fallback to a local flow instead of returning no healthy sql instances available for planning?

rafiss commented 11 months ago

I don't know - @yuzefovich you might be a good person to answer the above question.

dt commented 11 months ago

For the purposes of this test though, which isn't really about whether queries run or not with or without populated instances, but rather whether instance rows are populated at all, maybe we should just wrap the query in a SucceedsSoon() ?

knz commented 11 months ago

NB: yahor is on PTO for this week and beginning of next.

rafiss commented 11 months ago

@yuzefovich friendly ping on this now that you have returned. i think the SucceedsSoon change would make sense, but just checking if you thoughts on the above questions.

as an aside, i was looking for context on this and found some draft sqlliveness cleanups that we may want to incorporate still: https://github.com/cockroachdb/cockroach/pull/71620

yuzefovich commented 11 months ago

I thought that distsql was supposed to fallback to a local flow instead of returning no healthy sql instances available for planning?

This was an oversight, and I made some improvements in #110221 (which could be extended to retry "no healthy sql instances available for planning" errors but currently doesn't) and #110222 (which removes that error altogether), but the test still fails with

    instancestorage_test.go:354: condition failed to evaluate within 45s: sql: database is closed

(I added SucceedsSoon). This error seems to suggest that sqlDB connection is closed?