apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.41k stars 1.22k forks source link

[EPIC] A collection of items to improve DataFuson stability (reduce effort required to upgrade) #13648

Open alamb opened 9 hours ago

alamb commented 9 hours ago

Is your feature request related to a problem or challenge?

This is broken out from a more general ticket here

🥳 In my opinion DataFusion is now good enough (performance and feature wise) for many people to have buit real systems and products

However, as more people build "real" systems using DataFusion, our historic "move fast and break things and hope you can keep up" mentality likely needs to adjust to a more mature "move as fast as possible, but minimize breakages" type response.

My summary of the discussion on https://github.com/apache/datafusion/issues/13525 from @findepi @scsmithr @waynexia @timsaucer @Rachelint @Omega359 @jonmmease @Dandandan and @andygrove was that many existing heavy users of DataFusion spend a lot of time during upgrades from one DataFusion release to another

Specifically, I think the core challenge I heard was NOT the mechnical API changes required, but the effort required to diagnose more suble issues such as:

Describe the solution you'd like

I would like to improve the ease of upgrading DataFusion versions

There are many ways to do so and I would like to use this ticket to capture / organize the work in this area

Related Items

Additional testing

More Context:

jonathanc-n commented 9 hours ago

Yes, I think one of the comments in that discussion mentioned that certain changes that would cause breakage should be mentioned in every release. So before release, we should list out the possible changes that would need to be made if a upgrade were to happen during the development process.

alamb commented 9 hours ago

So before release, we should list out the possible changes that would need to be made if a upgrade were to happen during the development process.

I think it is a great idea. The challenge will be identifying such changes I thunk

comphead commented 8 hours ago

There is an interesting approach at MariaDB, they generate queries with different syntaxes to find regressions. Basically we can take their main.sql file which is 7MB of different queries including join queries and adapt it to DF.

There is no answers check, just smoke test that query can run successfully

The example can be found https://github.com/mariadb-corporation/mariadb-qa/tree/master/pquery

@alamb WDYT? it looks like a low hanging fruit, we can take the file and run it in latest datafusion CLI as part of CI or major release verification process

alamb commented 8 hours ago

@alamb WDYT? it looks like a low hanging fruit, we can take the file and run it in latest datafusion CLI as part of CI or major release verification process

I think in general the more testing we have the better. This idea sounds good to me -- I think more fully leveraging @2010YOUY01 's integration into sqlancer is also quite interesting.

Let's try and write some tickets to capture these ideas too - I can spend some time working on this over the next day or two

Omega359 commented 8 hours ago

There is an interesting approach at MariaDB, they generate queries with different syntaxes to find regressions. Basically we can take their main.sql file which is 7MB of different queries including join queries and adapt it to DF.

There is no answers check, just smoke test that query can run successfully

The example can be found https://github.com/mariadb-corporation/mariadb-qa/tree/master/pquery

@alamb WDYT? it looks like a low hanging fruit, we can take the file and run it in latest datafusion CLI as part of CI or major release verification process

Almost absolutely NOT. https://github.com/mariadb-corporation/mariadb-qa/blob/master/LICENSE.md

https://www.apache.org/legal/resolved.html#category-x

comphead commented 8 hours ago

Thats frustrating. Lets see if sqlancer can generate something similar.