Closed alamb closed 1 year ago
I am marking this as "good first issue" as I think it is fairly self contained (the runner should only have to interact with DataFusion as a user) so internal knowledge isn't needed, but you would get a lot of experience with writing a rust tool, CI, etc)
And you would get huge props from the community I think
I am willing to help shepherd this project / review PRs / help with contributions
Hey @alamb, I'm trying to get more involved in the project and this looks like something more substantial that I can take on!
Let me know if that works with you and I can start poking around in the duckdb and influxdb approaches to data-driven tests.
That would be awesome @mvanschellebeeck -- I think the idea of poking around ("background research" 😆 ) is a great way to start
Also note there may be others who are interested but are not online due to their timezones so I think it would be good to ensure they can collaborate as well
Nice to see the issue, thanks @alamb !
Test completeness is important for a database in the early stages of development, not only to ensure that new features can be tested completely but also to ensure that existing features are not broken in the rapid iteration process.
During our involvement in databend development, logicaltest brought a lot of benefits and made us feel confident about our query results. (FYI: https://databend.rs/doc/contributing/rfcs/new_sql_logic_test_framework.)
Currently, databend has leveraged crdb, ydb, and duckdb test cases. Thanks for the open-source community!
I also opened a related issue last year (https://github.com/apache/arrow-datafusion/issues/1453).
For arrow-datafusion, I think we can use rust to build our logictest framework, sqllogictest-rs may be a good candidate. After the framework is finished, we can leverage community's power to rich our logictest cases from other databases :)
@xudong963 -- using https://github.com/risinglightdb/sqllogictest-rs (either directly or forking it) sounds like a great idea!
Thank you for the write up
Thanks @xudong963. sqllogictest-rs looks an awesome starting point!
Looks like DuckDB runs an extended version of SQLLgic tests:
For testing plain SQL we use an extended version of the SQL logic test suite, adopted from SQLite
We could develop a pretty solid testing foundation with sqllogictest-rs as the parser and runner.
We could migrate a large chunk of tests with a clever regex that captures on the query and result variables and manually handle some hard-to-automate edge cases like this one (which creates a few tables) and this one (that expects a failure).
On using sqllogictest-rs vs forking it - using it directly seems like a good first step to get the test suite up and running. Though I can imagine some datafusion-specific queries (that @alamb noted) will eventually force us to move onto a fork in the future. As an example - Materialize created their own sqllogictest driver, mz_sqllogictest to deal with this.
@xudong963 @alamb
tests/sqllogictests
folder at the root of arrow-datafusionsqllogitest-rs
and datafusion-cli
datafusion-cli
Hi, I'm one of the maintainers in sqllogictest-rs. I'm very happy that arrow-datafusion and databend are interested in the project.
Currently, the project is only used by risinglight and risingwave, while the maintainers for these three projects overlap highly. However, we really hope to increase the diversity of the sqllogictest-rs community.
There are several components in sqllogictest-rs:
AsyncDB
.Currently, the parser part is generally stabilized and we add very few extensions to the original syntax, the most important thing is the include
macro, which is useful for reusing some test codes.
Currently, the trait is easy and general enough, it only accepts a SQL string and returns a concatenated result. This means that it's compatible with different SQL syntaxes (Postgres, MySQL, or even Spanner), but know very few things about the data type information. We even can't check whether the result types are correct for queries.
We'll improve that, but I'm not sure if should we generalize the data types between different databases, it looks like hard work.
A simple and general part.
We have several ways of working together.
AsyncDB
and the Postgres
implementations better.We would like to donate the project to some org if it would help our collaboration.
Thanks for your interest in sqllogictest-rs!
As a v0, how does this approach sound?
@mvanschellebeeck -- I think it sounds like a great idea and I can't wait to see it. Thank you so much
@TennyZhuang -- thank you very much for the writeup and explanation -- it sounds great.
Use the same project, and work together to make the AsyncDB and the Postgres implementations better.
I think this sounds like a great way to start. We can try to implement AsyncDB for DataFusion and see how it goes and propose extensions to AsyncDB
if needed 🤔
Make the project even more general, and open to every database interface. (Likely not a short-term goal).
I love this idea, though agree it will be much more work;
We would like to donate the project to some org if it would help our collaboration.
From my perspective, given that sqllogictest-rs is apache licensed using it directly in DataFusion for testing would be totally fine as the only project affected would be DataFusion itself. I think additional considerations apply to new dependencies of DataFusion itself which not only affect DataFusion but all downstream projects that use DataFusion
4. Migrate some tests from sql_integration
Others are good to me, but for this, I think maybe we don't migrate them first, I can directly add new tests from other DBs.
Others are good to me, but for this, I think maybe we don't migrate them first, I can directly add new tests from other DBs.
That would also be fine. One of the benfits of migrating a few of the tests from sql_integration would be to demonstrate that the harness could (eventually) subsume them all.
Hi @mvanschellebeeck, are you doing it?
Hi @mvanschellebeeck, are you doing it?
Yep! I can get out a draft PR today/tomorrow if that works with you?
@mvanschellebeeck Nice,no hurry, keep your step.
hey @xudong963, @alamb,
I've got a draft PR up here: https://github.com/apache/arrow-datafusion/pull/4395
In general, sqllogictests operate on a workflow where the .slt
file creates tables, inserts values and then queries the values for expected results. From what I can tell, CREATE TABLE and INSERT INTO are not currently supported in datafusion so I raised #4396 and #4397 respectively. After they're implemented we'll be able to incorporate a lot of the existing sqllogictests.
I've started migrating some of the tests from aggregate.rs into an sqllogictest file, aggregate.slt. The queries here require a specific datafusion setup which I moved into setup.rs.
Some thoughts as I worked through this:
format_batches
and run_query
at the bottom of this filedatafusion-cli wasn't that easy to use as a library without Command wizardry so I resorted to datafusion-core functions with format_batches and run_query at the bottom of this file
I think hooking it up as you have it is the right way to go. We can look into testing datafusion-cli specific stuff as a follow on project
sqllogictests can not test the output column names (that are currently testing in integration tests)
I think perhaps we could leave a few high level rust #test
style tests for validating the column names and move the rest over
still figuring out a few TODOS in the aggregate.slt file
👍
I think now that we have merged https://github.com/apache/arrow-datafusion/pull/4395 we should close this PR and I will write a follow on ticket / epic to track follow on improvements. Thanks again @xudong963 @mvanschellebeeck and everyone else who chimed in
Filed https://github.com/apache/arrow-datafusion/issues/4460 for follow on work
Is your feature request related to a problem or challenge? Please describe what you are trying to do. I would like to ensure that DataFusion gets the correct answers for SQL queries (especially in tricky corner cases like the one described in https://github.com/apache/arrow-datafusion/issues/4211)
From experience, both in DataFusion and in prior jobs, the effort required to maintain tests (both to add new tests as well as update existing tests) is substantial. Making it easier to add new tests and maintain existing ones will help us keep up velocity.
Right now, we have two sql integration style tests:
integration
test from @Jimexist 🦾 https://github.com/apache/arrow-datafusion/tree/master/integration-tests: Runs a limited number of queries against data in both postgres and datafusion and compares the resultssql_integration
test https://github.com/apache/arrow-datafusion/tree/master/datafusion/core/tests/sql: test setup, execution, and verification is written in rust.The challenge with sql_integration test is that to add new tests or update existing ones, we need to change rust code and recompile, which takes a loong time
Likewise, the integration test requires that the results are exactly the same as postgres which is not possible in all cases (like when testing for unsigned types, which postgres doesn't support, or testing some DataFusion specific thing)
Describe the solution you'd like I would like some sort of data driven test to replace sql_integration
You can see this style of test in duckdb: https://github.com/duckdb/duckdb/blob/master/test/sql/join/empty_joins.test
My ideal solution would be to implement a runner (ideally the same as SQLLogicTests from DuckDB)
I implemented a impler version of this approach in https://github.com/influxdata/influxdb_iox/blob/main/query_tests/README.md which runs sql queries from a file and compares the result to known output. I think the duckdb way is superior
Describe alternatives you've considered Leave things the same
Additional context Add any other context or screenshots about the feature request here.