Some initial interceptor implementations

waynexia commented 1 year ago

Describe This Problem

"Interceptor" is a hook point in sqlness, which allows people to write tests with extra functionality. E.g., provide per-case config, parameterized SQL or resource limitation etc.

Proposal

Some interceptable points I can think of are:

before query execution
after query execution
before one .sql file start

And I plan to add one simple interceptor implementation for each point as an example, and to see if this mechanism actually works.

Interceptor Name	Stage	Description
parameter	before execution	Allow to pass K-V like parameters as execution context to DB. Some pre-query setting can be done in this way, like query mode (e.g., whether oracle-compatible in MySQL) or current user.
replace-result	after execution	Allow to modify the query result with sed-like grammar. This is useful when you only care about some parts of the result, or your query will generate some random values.
time-limit	file level	Limit the total execution time for a `.sql` case

Additional Context

No response

jiacai2050 commented 1 year ago

before query execution

In datafusion, it support set statement to do this kinds of job.

https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sqllogictests/test_files/information_schema.slt#L30

As for other two cases, they seems not very useful in practice.

statement error Error during planning: SHOW TABLES is not supported unless information_schema is enabled
SHOW TABLES

Also I find this error declare is useful when our sql files have lot of cases, and we need to check which SQL will throw error.

waynexia commented 1 year ago

In datafusion, it support set statement to do this kinds of job.

Yes, parameter is to provide something like this. DataFusion does support this, and as well as MySQL, PostgreSQL. But I'm doubting if this is a standard SQL grammar and, to my opinion, it's not a conflict for us (a test framework) to support this as well. Consider these scenarios:

My database is distributed on A, B and C, I set data distribution rules for my table and I want to verify it. I execute a distributed insert, and want to connect to A, B, and C to do a non-distributed query. I want to know what query will be sent to which instance.
My database implementation simply does not support SET, but I want to change my configuration via APIs other than SQL like HTTP.

Those are two practical examples I come up with, and are relying on test context. The parameter are the way to provide context to each SQL.

As for other two cases, they seems not very useful in practice.

Post-process like replace-result is widely used in MySQL's integration test. It's very normal for a query to results in random values. What if you query random(), current_time(), or system table for current memory consumption? I'm not care about the actually value, but just want to make sure they works, or can match some fixed pattern.

Also I find this error declare is useful when our sql files have lot of cases, and we need to check which SQL will throw error.

If I read it correctly, the error message will present in .result file next to the query resulting it?

jiacai2050 commented 1 year ago

My database is distributed on A, B and C, I set data distribution rules for my table and I want to verify it. I execute a distributed insert, and want to connect to A, B, and C to do a non-distributed query. I want to know what query will be sent to which instance.

This example make sense to me, as for other aspect, I think we can wait until some real world issue arise .

The first principle I obey is to stick with SQL, if SQL can fix it, then we don't have to, it bring little value to this project IMO.

If I read it correctly, the error message will present in .result file next to the query resulting it?

I usually check SQL file to see how many cases have been added, but I can't tell the bad cases from the good one, I don't care what this SQL will output, I only want to know which SQL will throw error.

Maybe we can add a ERR: <reason> special syntax to declare it(this may belong to your first proposal parameter).

-- ERR: don't support XX type now
create table t(a XX);

waynexia commented 1 year ago

My database is distributed on A, B and C, I set data distribution rules for my table and I want to verify it. I execute a distributed insert, and want to connect to A, B, and C to do a non-distributed query. I want to know what query will be sent to which instance.

This example make sense to me, as for other aspect, I think we can wait until some real world issue arise .

Nice, I'll start working on this one.

The first principle I obey is to stick with SQL, if SQL can fix it, then we don't have to, it bring little value to this project IMO.

I agree. But how do I test a random() function or results contains timestamp. I think this is also a necessary part to accomplish SQL's functionality.

If I read it correctly, the error message will present in .result file next to the query resulting it?

I usually check SQL file to see how many cases have been added, but I can't tell the bad cases from the good one, I don't care what this SQL will output, I only want to know which SQL will throw error.

Maybe we can add a ERR: <reason> special syntax to declare it(this may belong to your first proposal parameter).
-- ERR: don't support XX type now
create table t(a XX);

That makes sense. We should preserve the comment in the result file. (maybe comment is enough? Is it necessary to add new syntax?)

jiacai2050 commented 1 year ago

I agree. But how do I test a random() function or results contains timestamp. I think this is also a necessary part to accomplish SQL's functionality.

It's possible to remove random value in this way

select count(random())

That makes sense. We should preserve the comment in the result file. (maybe comment is enough? Is it necessary to add new syntax?)

I prefer hard requirement, comment is optional.

waynexia commented 1 year ago

I agree. But how do I test a random() function or results contains timestamp. I think this is also a necessary part to accomplish SQL's functionality.

It's possible to remove random value in this way
select count(random())

Then I'd ask as a user, why this framework doesn't support SQL like random() and force me to write redundant SQL?

And the second use case, how do I test a system table query? E.g., I expect to see the node info, but want to ignore the memory usage to each node because it changes time by time. Do I have to project out that column? Then what about if my query contains timestamp column? I don't think this way is a good practice. "fix it by SQL" should be "I want to project some column, and the SQL does support projection", but not "I don't want to project column, but the framework forces me to make projection". The SQL query result is naturally unstable in most cases. That is the point we should stick to and add support for.

Require every SQLs to be stable reproducible because we are a text comparison based framework. That doesn't look good 😥

That makes sense. We should preserve the comment in the result file. (maybe comment is enough? Is it necessary to add new syntax?)

I prefer hard requirement, comment is optional.

What do you mean by hard-requirement? Isn't the original goal to make the case readable?

jiacai2050 commented 1 year ago

Then I'd ask as a user, why this framework doesn't support SQL like random() and force me to write redundant SQL?

I'm not very strong about this, I just think we should keep result as 'static' as possible, replace-result interceptor indeed has its value, and wait to see your implementation.

What do you mean by hard-requirement? Isn't the original goal to make the case readable?

hard requirement means we must add -- ERR: xx prefix for bad sql, otherwise next SQL should not throw error.

I think it's not conflict with readability, instead it make more clear what the SQL will do.

waynexia commented 1 year ago

hard requirement means if we must add -- ERR: xx prefix for bad sql, otherwise next SQL should not throw error.

I think it's not conflict with readability, instead it make more clear what the SQL will do.

Got it. This is already covered by now? If a query returns an error, it result would be something like ERR: xxx (depending on how the user formats their result). And re-run it should also return that error.

jiacai2050 commented 1 year ago

Got it. This is already covered by now?

Not exactly, we have to check result file to ensure this, what I want is to declare this in original SQL

statement error DataFusion error: Error during planning: table 'datafusion.information_schema.tables' not found
SELECT * from information_schema.tables

Like this in sqllogcitest.

waynexia commented 1 year ago

Oh, like "should_panic" or "statement err", right? We can have a try. But I'm not sure if we should also take xxx into consideration, as it's duplicated with the follow-up full-text match. Maybe left the reason part to only be a comment?

jiacai2050 commented 1 year ago

Oh, like "should_panic" or "statement err", right?

Yes.

Maybe left the reason part to only be a comment?

Comment seems OK to me.

CeresDB / sqlness