apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.94k stars 1.13k forks source link

Make a data driven SQL testing tool (so we can reuse duckdb test suite, example) #4248

Closed alamb closed 1 year ago

alamb commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. I would like to ensure that DataFusion gets the correct answers for SQL queries (especially in tricky corner cases like the one described in https://github.com/apache/arrow-datafusion/issues/4211)

From experience, both in DataFusion and in prior jobs, the effort required to maintain tests (both to add new tests as well as update existing tests) is substantial. Making it easier to add new tests and maintain existing ones will help us keep up velocity.

Right now, we have two sql integration style tests:

  1. the integration test from @Jimexist 🦾 https://github.com/apache/arrow-datafusion/tree/master/integration-tests: Runs a limited number of queries against data in both postgres and datafusion and compares the results

The challenge with sql_integration test is that to add new tests or update existing ones, we need to change rust code and recompile, which takes a loong time

Likewise, the integration test requires that the results are exactly the same as postgres which is not possible in all cases (like when testing for unsigned types, which postgres doesn't support, or testing some DataFusion specific thing)

Describe the solution you'd like I would like some sort of data driven test to replace sql_integration

You can see this style of test in duckdb: https://github.com/duckdb/duckdb/blob/master/test/sql/join/empty_joins.test

My ideal solution would be to implement a runner (ideally the same as SQLLogicTests from DuckDB)

  1. Using the same data file format as duckdb (will mean we could reuse their tests without much modification)
  2. Start porting as many of the tests in sql_integration over to this new format as possible)

I implemented a impler version of this approach in https://github.com/influxdata/influxdb_iox/blob/main/query_tests/README.md which runs sql queries from a file and compares the result to known output. I think the duckdb way is superior

Describe alternatives you've considered Leave things the same

Additional context Add any other context or screenshots about the feature request here.

alamb commented 1 year ago

I am marking this as "good first issue" as I think it is fairly self contained (the runner should only have to interact with DataFusion as a user) so internal knowledge isn't needed, but you would get a lot of experience with writing a rust tool, CI, etc)

And you would get huge props from the community I think

I am willing to help shepherd this project / review PRs / help with contributions

mvanschellebeeck commented 1 year ago

Hey @alamb, I'm trying to get more involved in the project and this looks like something more substantial that I can take on!

Let me know if that works with you and I can start poking around in the duckdb and influxdb approaches to data-driven tests.

alamb commented 1 year ago

That would be awesome @mvanschellebeeck -- I think the idea of poking around ("background research" 😆 ) is a great way to start

Also note there may be others who are interested but are not online due to their timezones so I think it would be good to ensure they can collaborate as well

xudong963 commented 1 year ago

Nice to see the issue, thanks @alamb !

Test completeness is important for a database in the early stages of development, not only to ensure that new features can be tested completely but also to ensure that existing features are not broken in the rapid iteration process.

During our involvement in databend development, logicaltest brought a lot of benefits and made us feel confident about our query results. (FYI: https://databend.rs/doc/contributing/rfcs/new_sql_logic_test_framework.)

Currently, databend has leveraged crdb, ydb, and duckdb test cases. Thanks for the open-source community!

I also opened a related issue last year (https://github.com/apache/arrow-datafusion/issues/1453).

For arrow-datafusion, I think we can use rust to build our logictest framework, sqllogictest-rs may be a good candidate. After the framework is finished, we can leverage community's power to rich our logictest cases from other databases :)

alamb commented 1 year ago

@xudong963 -- using https://github.com/risinglightdb/sqllogictest-rs (either directly or forking it) sounds like a great idea!

Thank you for the write up

mvanschellebeeck commented 1 year ago

Thanks @xudong963. sqllogictest-rs looks an awesome starting point!

Looks like DuckDB runs an extended version of SQLLgic tests:

For testing plain SQL we use an extended version of the SQL logic test suite, adopted from SQLite

But with:

  1. A port of the tests from sql_integration
  2. And the utilisation of the huge set existing SQLLite sqllogictests (https://www.sqlite.org/sqllogictest/dir?ci=tip&name=test)

We could develop a pretty solid testing foundation with sqllogictest-rs as the parser and runner.

On Current Test Migration

We could migrate a large chunk of tests with a clever regex that captures on the query and result variables and manually handle some hard-to-automate edge cases like this one (which creates a few tables) and this one (that expects a failure).

On Forking vs Directly using sqllogictest-rs

On using sqllogictest-rs vs forking it - using it directly seems like a good first step to get the test suite up and running. Though I can imagine some datafusion-specific queries (that @alamb noted) will eventually force us to move onto a fork in the future. As an example - Materialize created their own sqllogictest driver, mz_sqllogictest to deal with this.

As a v0, how does this approach sound?

@xudong963 @alamb

  1. Add a tests/sqllogictests folder at the root of arrow-datafusion
  2. Use sqllogitest-rs and datafusion-cli
  3. Implement the sqllogictest-rs runner using datafusion-cli
  4. Migrate some tests from sql_integration
  5. Setup a simple CI workflow to run the tests
TennyZhuang commented 1 year ago

Hi, I'm one of the maintainers in sqllogictest-rs. I'm very happy that arrow-datafusion and databend are interested in the project.

Currently, the project is only used by risinglight and risingwave, while the maintainers for these three projects overlap highly. However, we really hope to increase the diversity of the sqllogictest-rs community.

Components

There are several components in sqllogictest-rs:

  1. The core slt file parser.
  2. The driver trait AsyncDB.
  3. A cli tool to visit files by glob paths, parse, run and compare results in the different logical namespaces.
  4. (WIP/Experimental) A subprocess-based plugin framework to test the compatibility for non-rust drivers, e.g. JDBC.

Parser

Currently, the parser part is generally stabilized and we add very few extensions to the original syntax, the most important thing is the include macro, which is useful for reusing some test codes.

AsyncDB trait

Currently, the trait is easy and general enough, it only accepts a SQL string and returns a concatenated result. This means that it's compatible with different SQL syntaxes (Postgres, MySQL, or even Spanner), but know very few things about the data type information. We even can't check whether the result types are correct for queries.

We'll improve that, but I'm not sure if should we generalize the data types between different databases, it looks like hard work.

Runner

A simple and general part.

Conclusion

We have several ways of working together.

  1. Just fork the project, customize them by yourselves and contributions are always welcome!
  2. Use the sqllogictest-rs parser and customize other things by yourselves.
  3. Use the same project, and work together to make the AsyncDB and the Postgres implementations better.
  4. Make the project even more general, and open to every database interface. (Likely not a short-term goal).

We would like to donate the project to some org if it would help our collaboration.

Thanks for your interest in sqllogictest-rs!

alamb commented 1 year ago

As a v0, how does this approach sound?

@mvanschellebeeck -- I think it sounds like a great idea and I can't wait to see it. Thank you so much

alamb commented 1 year ago

@TennyZhuang -- thank you very much for the writeup and explanation -- it sounds great.

Use the same project, and work together to make the AsyncDB and the Postgres implementations better.

I think this sounds like a great way to start. We can try to implement AsyncDB for DataFusion and see how it goes and propose extensions to AsyncDB if needed 🤔

Make the project even more general, and open to every database interface. (Likely not a short-term goal).

I love this idea, though agree it will be much more work;

We would like to donate the project to some org if it would help our collaboration.

From my perspective, given that sqllogictest-rs is apache licensed using it directly in DataFusion for testing would be totally fine as the only project affected would be DataFusion itself. I think additional considerations apply to new dependencies of DataFusion itself which not only affect DataFusion but all downstream projects that use DataFusion

xudong963 commented 1 year ago

4. Migrate some tests from sql_integration

Others are good to me, but for this, I think maybe we don't migrate them first, I can directly add new tests from other DBs.

alamb commented 1 year ago

Others are good to me, but for this, I think maybe we don't migrate them first, I can directly add new tests from other DBs.

That would also be fine. One of the benfits of migrating a few of the tests from sql_integration would be to demonstrate that the harness could (eventually) subsume them all.

xudong963 commented 1 year ago

Hi @mvanschellebeeck, are you doing it?

mvanschellebeeck commented 1 year ago

Hi @mvanschellebeeck, are you doing it?

Yep! I can get out a draft PR today/tomorrow if that works with you?

xudong963 commented 1 year ago

@mvanschellebeeck Nice,no hurry, keep your step.

mvanschellebeeck commented 1 year ago

hey @xudong963, @alamb,

I've got a draft PR up here: https://github.com/apache/arrow-datafusion/pull/4395

In general, sqllogictests operate on a workflow where the .slt file creates tables, inserts values and then queries the values for expected results. From what I can tell, CREATE TABLE and INSERT INTO are not currently supported in datafusion so I raised #4396 and #4397 respectively. After they're implemented we'll be able to incorporate a lot of the existing sqllogictests.

I've started migrating some of the tests from aggregate.rs into an sqllogictest file, aggregate.slt. The queries here require a specific datafusion setup which I moved into setup.rs.

Some thoughts as I worked through this:

alamb commented 1 year ago

datafusion-cli wasn't that easy to use as a library without Command wizardry so I resorted to datafusion-core functions with format_batches and run_query at the bottom of this file

I think hooking it up as you have it is the right way to go. We can look into testing datafusion-cli specific stuff as a follow on project

sqllogictests can not test the output column names (that are currently testing in integration tests)

I think perhaps we could leave a few high level rust #test style tests for validating the column names and move the rest over

still figuring out a few TODOS in the aggregate.slt file

👍

alamb commented 1 year ago

I think now that we have merged https://github.com/apache/arrow-datafusion/pull/4395 we should close this PR and I will write a follow on ticket / epic to track follow on improvements. Thanks again @xudong963 @mvanschellebeeck and everyone else who chimed in

alamb commented 1 year ago

Filed https://github.com/apache/arrow-datafusion/issues/4460 for follow on work