apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.87k stars 1.11k forks source link

Add fuzzing / random SQL testing #913

Closed alamb closed 2 weeks ago

alamb commented 3 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. In a past project, we had a harness that could generate random SQL queries and it found many bugs -- such tests are a wonderful way to help database software mature. Applying such technology to DataFusion would be very cool.

On https://github.com/sqlparser-rs/sqlparser-rs/pull/312#discussion_r693298728, @PsiACE points at a good blog post from cockroachdb: https://www.cockroachlabs.com/blog/sqlsmith-randomized-sql-testing/ that describes such testing

One of the tools mentioned is https://github.com/anse1/sqlsmith

Describe the solution you'd like Add a script / way to run SQLSmith against DataFusion. As described in the blog this might require modifying SQLSmith to restrict itself to the subset of postgres datafusion supports

I would suggest we don't put this in CI initially until someone has the bandwidth to review the results, but getting the scripts that could be run setup would be a great first step

Describe alternatives you've considered Haven't done research into alternatives to SQlsmitg

Additional context https://github.com/sqlparser-rs/sqlparser-rs/pull/312

andygrove commented 3 years ago

This paper would be worth a read too for anyone interested to learn how Databricks uses query fuzzing with Spark.

I have been doing some query fuzzing myself in my day job, to compare Spark with Spark on GPU (using the RAPIDS Accelerator for Apache Spark). My approach there was to generate logical query plans directly (via Spark's DataFrame API).

I had been contemplating doing something similar with DataFusion/Ballista by generating random plans in Rust and encoding them to protobuf using the Ballista serde module and then writing Scala code to read these protobuf files and translate them to Spark plans. I have an old proof-of-concept of some of this already in my How Query Engines Work repo.

With the new Arrow Compute IR proposal, an approach along these lines would be useful for having fuzzing tools that work across Arrow implementations as well.

andygrove commented 2 years ago

I have started work on fuzzing SQL and data using https://github.com/andygrove/sqlfuzz and plan on eventually adding tests to this project but for now, I am doing this separately. It has already been effective in finding bugs.

alamb commented 2 weeks ago

Given https://github.com/apache/datafusion/issues/11030 / https://github.com/datafusion-contrib/datafusion-sqlancer from @2010YOUY01 I think I am going to claim this issue is closed.

Thanks again @2010YOUY01