chaisql / chai

Modern embedded SQL database
MIT License
1.56k stars 95 forks source link

SELECT: If the schema is strict and a projected field doesn't exist, we should return an error #465

Open asdine opened 2 years ago

asdine commented 2 years ago
genji> create table foo(a int);
genji> insert into foo (a) values (1);
genji> select a, b from foo;
{
  "a": 1,
  "b": null
}

An error should be returned instead

The detection must be done during the preparation phase, before executing the query.

tzzed commented 2 years ago

This implies the WHERE clause too.

SELECT a from foo WHERE b > 10; 

should also returns an error.

asdine commented 2 years ago

This implies the WHERE clause too.

Good point, anywhere an expression is used in the SELECT statement (WHERE, GROUP BY, etc.)

Darkclainer commented 2 years ago

I was tackling issue #464 and found that it is more than related to this issue.

Consider we have changed behavior how document.Path (or expr.Path that is the same type) evaluated: now we treat non existing paths as errors. Studying broken tests I noticed next issues:

  1. There is at least one test that affected by the discussed issue. It uses next statement:

    SELECT COUNT(a) FROM test WHERE a < ? GROUP BY b ORDER BY a DESC LIMIT 5

    Subtle thing that there is no a in context of ORDER BY, but only count(a). PR #477 can detect error (not certain what error in particular, because query like SELECT COUNT(a) AS cnt FROM test WHERE a < 3 GROUP BY b ORDER BY cnt DESC LIMIT 5 also emit error)

  2. GROUP BY with complex expression become broken. I didn't examine why.

  3. Legit case where we refer to a non-existent field in the non-strict schema doesn't work (unsurprisingly).

So, the way how KVPairs evaluated is deeply tighten with how statements are evaluated too.

Probably we should think about these issue not like "forbid some behavior in this case", but more like "forbid using undefined identificators everywhere except where non-strict schema used".

asdine commented 2 years ago

I think it might be easier and less expensive to do a static check only once during the preparation phase, right after the query has been parsed, rather than during the evaluation phase of KVPairs and other affected expressions.

Lifecycle of a query

First, here is a simplistic overview of the lifecycle of a query:

query string (ex: SELECT a FROM foo WHERE b > 1) -> parsing (generates a statement object) -> preparation phase (generates a stream object) -> stream optimization -> stream execution

During the preparation phase, we have a full view of the query right after it has been parsed, and we also have access to the table information (what is the schema, is it strict, etc.). The other advantage is that any check done here is done only once, whereas the during the execution phase we might do checks for every document read from the stream.

Proposal

To do this we need to modify this SelectStmt and SelectCoreStmt structures (they are both in the same file): https://github.com/genjidb/genji/blob/487d4b4eec9331c86b4ead7d30aa4eb2e8b1e7f1/internal/query/statement/select.go#L146 https://github.com/genjidb/genji/blob/487d4b4eec9331c86b4ead7d30aa4eb2e8b1e7f1/internal/query/statement/select.go#L15

(to better understand the relationship between these two structures, see this schema in the documentation)

We then simply need to read the context received in the Prepare method and access the table information. If the schema is strict, we use the expr.Walk function to walk over all the expressions of the statement, one by one.

Then we do the same to all other statements (UpdateStmt, DeleteStmt, etc.)

Darkclainer commented 1 year ago

Hello, @asdine! Thanks for great explanation you provided earlier, that was a lot of help to me. We discussed this feature a while ago, but finally I have enough time to give it enough time for some completion.

I made some early draft only for select core case #504 and realized there are a lot of design questions that can affect the scope of PR (one or many) and because I'm by any means no maintainer I want to address them here and then continue to work on this feature, so if you are fine with this, there my thought on this feature.

  1. What should we do about named expressions? I have the feeling that they do not work at all and probably should be done first, but this issue can provide some backbone.

    UPDATE: I had felling that SQL have this feature, but seems I was very wrong, and WHERE not supposed to support aliased expressions. Interestingly this seems relatively easy to implement in Genji.

    CREATE TABLE t (a INT);
    INSERT INTO t (a) VALUES (1);
    
    # currently returns nothing, so this is probably correct, but current
    # validation can generate error to indicate that this is not allowed.
    SELECT a + 1 as b FROM t WHERE b > 1;
    # this works as expected
    SELECT a FROM t WHERE a + 1 > 1;

    Of course this generates a lot of interesting cases, for example with anonymous types:

    CREATE TABLE t ( a ( b INT ));
    INSERT INTO t (a) VALUES ({"b": 1});
    # this works
    SELECT a.b FROM t WHERE a.b > 0;
    # this don't
    SELECT a.b AS e FROM t WHERE e > 0;
  2. Should we do type validation? It certainly possible, but I feel like this is way out of scope and can be safely done later. I mean cases like this:

    CREATE TABEL t (a INT);
    
    # can generate error
    SELECT a == "string" from t;
    
    # also can generate error
    SELECT a FROM t WHERE a == "string";
  3. Probably validation error can be recognized in autocompletion? I didn't examine what feature provide autocompletion, but it can break it?