Feature Casualties of Large Databases: test suite for checking indices

mitar commented 2 years ago

In Feature Casualties of Large Databases you write:

A communication protocol that allows the query to signal out-of-band with a query’s results that it didn’t run particularly efficiently, say that it got results but wasn’t able to make use of an index. This would allow a test suite to fail early by signaling the problem to a developer instead of finding out about it in production.

But is this really necessary? I mean, if it is used only inside test suite, then you could instrument your tests so that for every query EXPLAIN is first called on it, asserting which indices it will use? Or you assert that only indices are used. So then your test assertions fail like any other assertions if it changes (like because query is changed in the future)?

So I think this is mostly a tooling issue where test frameworks does not expose such asserts readily available.

brandur commented 2 years ago

I'm not 100% sure that it's the right answer really, but wouldn't a broader solution be kind of nice so that not every database driver in every language has to bake in a post-query EXPLAIN?

It doesn't really matter anyway though since even if it were a good idea, such a feature is vanishingly unlikely to ever happen :)

mitar commented 2 years ago

Sure, support for this from a database would be great. Or even being able to configure in the database (maybe per connection/session) that queries which do not use indices should just fail. Then you could (in testing or production) enable that at the beginning of the connection/session. So maybe no need to signal to the driver, but simply enable it only.

But I was more thinking of what can be done at a project level, if one wants this. And sqlc seems to be well positioned for something like that: it already parses queries, so it could also extract all the queries inside a function call and pass them through EXPLAIN during testing, asserting some properties beyond just typing.

brandur commented 2 years ago

But I was more thinking of what can be done at a project level, if one wants this. And sqlc seems to be well positioned for something like that: it already parses queries, so it could also extract all the queries inside a function call and pass them through EXPLAIN during testing, asserting some properties beyond just typing.

Yeah, it might be kind of neat if you could have a plugin system that would allow you to gather the extra EXPLAIN statistics post-query.

Another issue of course is that the planner often behaves differently when tables are close to empty, which is the likely state of your development or testing environment, so you'd almost have to run things against production to get the right explain results, or artificially fill a database to make it more production-y. Definitely possible, but makes the problem harder again.

mitar commented 2 years ago

In general I think that we need Babel for SQL, which would allow transforming SQL before hitting the database through a series of pluigns. For example, a plugin which adds/ensures access checks by transforming SQL to add all those conditions. Or a plugin which makes SQL case sensitive, so that you do not have to escape identifiers all the time with quotes if you prefer camelCase identifiers, or a plugin with automatically converts SQL query results into JSON so you do not have to add all array_agg. I tried to start this once, but the state of independent SQL parsers/deparsers is pretty sad. Sqlite uses a custom grammar tool, PostgreSQL has now libpg_query but because it is really using PostgreSQL parser it for example removing case sensitivity at parsing.

I think this is why we have so wide proliferation of custom query languages. We would not need GraphQL if you could just write SQL on the client and have a parser which parses it and gives you typing information, make it easy to transform it, stitch it with another query, and so on. Then GraphQL would be just a SQL on the client + plugin on the server which forbids SELECT * (through parsing queries). :-)

brandur commented 2 years ago

Yeah, a generalized SQL parser would be pretty neat — I feel like libpg_query took us really far in the right direction, but overall I'm sure there's still room for improvement.

There might be some performance implications to giving users direct access to something as powerful as SQL, but yeah, that'd be the dream. A single custom SQL statement is expressive enough to get data that would otherwise only be available by fetching thousands of different resources across collections in a REST API.

mitar commented 2 years ago

There might be some performance implications to giving users direct access to something as powerful as SQL

GraphQL has exactly the same problem and you have to employ then limiting on complexity you want to allow. There are libraries for this, too. Or this approach. Really, GraphQL is reinventing SQL. :-)

Anyway, what I am trying to say is that if you have SQL parser, then you can also compute complexity of the SQL and restrict that. I mean, query planner is already doing that.

A single custom SQL statement is expressive enough to get data that would otherwise only be available by fetching thousands of different resources across collections in a REST API.

And you could then have a plugin which converts your regular SQL query into one which returns related tables as embedded objects/JSON, so that the result is a nicely embedded structure you would expect (and not a table with many rows repeated).

What I dream of is having an SQL transformation engine and then an API endpoint is really just an exposed prepared statement which has been transformed to do ACL checking, parameter mapping to SQL placeholder mapping (PostgreSQL has many nice parsing functions), and conversion of results into JSON, which are then just returned to the client.

And then you can also have free-form endpoint, against which you can send arbitrary SQL (like you can do with sending arbitrary GraphQL). And then API provider can then decide that some commonly used queries against free-form endpoint can be made into prepared statements.

I mean, all that has already been done by Hasura:

They augment all queries with ACL checks at the query level, by rewriting queries, so you can only get data out which you have access to.
They map GraphQL parameters to SQL parameters.
They return data as JSON/GraphQL response.
They compute complexity of queries to block too complex ones.

The only difference is that they are taking as input GraphQL and not raw SQL. But that is just because they are lacking a good SQL parser. :-)

brandur / sorg

Feature Casualties of Large Databases: test suite for checking indices #285