apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.92k stars 1.12k forks source link

Eliminate Unsatisfiable Boolean Expressions #1716

Open tustvold opened 2 years ago

tustvold commented 2 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I noticed Datafusion is not able to simplify the expression

A ^ !A

Whilst it is unlikely for a user to write such an expression, it is not uncommon for boolean expressions to reduce to something unsatisfiable. Where this is the case, substituting in false may allow for further simplification.

Describe the solution you'd like

It would be awesome if Datafusion was able to detect and eliminate unsatisfiable boolean expressions. I believe this is an NP-hard problem in the general case, but should be tractable for a low number of variables.

tustvold commented 2 years ago

On a related note it is unable to simplify expressions of the form

A ^ (!A v B)
Dandandan commented 2 years ago

Related to this, I think the PoC Tokomak optimizer I created is especially suited for these kind of optimizations. https://github.com/Dandandan/datafusion-tokomak

@pjmore extended this optimizer to be able to optimize both expressions and logical plan nodes. See here for more: https://github.com/apache/arrow-datafusion/pull/1485

tustvold commented 2 years ago

I'm not at all familiar with e-graphs, but I would perhaps naively expect something specialised to boolean algebra to significantly out-perform a method based on arbitrary expression rewriting? I guess I'm thinking something like BDDs, or even just recursively applying shannon expansion and relying on const propagation to fix the mess...

This isn't an area I'm hugely familiar with, but I can't help feeling it should just be the case of picking an off-the-shelf boolean expression minimizer and plugging it in :sweat_smile:

FWIW a quick google search found this - https://docs.rs/boolean_expression/latest/boolean_expression/index.html

alamb commented 2 years ago

I do wonder if https://docs.rs/boolean_expression/latest/boolean_expression/index.html handles Nulls 🤔 which is an important niche case for database / sql

alamb commented 2 years ago

FWIW (A ^ !A) is NOT always false because of .... 🥁 NULL

alamb=# select null AND (not null);
 ?column? 
----------

(1 row)
alamb commented 2 years ago

We can do this rewrite

A ^ !A = false (iff A is not nullable)
tustvold commented 2 years ago

Does something similar also hold for things like A AND false? Just wondering if the existing logic handles this correctly... 🤔

alamb commented 2 years ago

Does something similar also hold for things like A AND false? Just wondering if the existing logic handles this correctly... 🤔

Yes. In fact @NGA-TRAN fixed a bug related to that in https://github.com/apache/arrow-datafusion/pull/1245 -- I think we have it sufficiently tested now.

See some of the rules here https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/simplify_expressions.rs#L568

It actually turn out that A AND false is always false (even if A is NULL)

However, A OR false could be NULL or false

🤯

tustvold commented 2 years ago

Aah yes, the SQL interpretation of NULL as "unknown" vs perhaps more intuitive notions of it... :sweat_smile:

null == null --> null will never cease to make no sense to me...

jackwener commented 2 years ago

It's depend on #2254