apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.18k stars 1.17k forks source link

SimplifyExpressions should not consider volatile expressions equal for rewrites #13060

Open eejbyfeldt opened 4 days ago

eejbyfeldt commented 4 days ago

Describe the bug

Currently we do not consider the volatility of expressions in SimplifyExpressions. This leads us to doing rewrites that might change the results and lead to unexpected behavior.

To Reproduce

Consider the following query:

> explain select * from VALUES (1), (2) where random() = 0 OR (column1 = 2 AND random() = 0);
+---------------+---------------------------------------------+
| plan_type     | plan                                        |
+---------------+---------------------------------------------+
| logical_plan  | Filter: random() = Float64(0)               |
|               |   Values: (Int64(1)), (Int64(2))            |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
|               |   FilterExec: random() = 0                  |
|               |     ValuesExec                              |
|               |                                             |
+---------------+---------------------------------------------+
2 row(s) fetched. 
Elapsed 0.013 seconds.

The predicate get simplified into random() = 0

Expected behavior

The predicate should not be simplified so we deduplicat the volatile expressions.

> explain select * from VALUES (1), (2) where random() = 0 OR (column1 = 2 AND random() = 0);
+---------------+----------------------------------------------------------------------------------+
| plan_type     | plan                                                                             |
+---------------+----------------------------------------------------------------------------------+
| logical_plan  | Filter: random() = Float64(0)  OR column1 = Int64(2) AND  random() = Float64(0)  |
|               |   Values: (Int64(1)), (Int64(2))                                                 |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                      |
|               |   FilterExec: random() = 0                                                       |
|               |     ValuesExec                                                                   |
|               |                                                                                  |
+---------------+----------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.013 seconds.

Additional context

We can not exclude volatile expressions outright from simplification as we would still like the simplify for example following predicate

> explain select * from VALUES (1), (2) where column1 = 2 OR (column1 = 2 AND random() = 0);
+---------------+---------------------------------------------+
| plan_type     | plan                                        |
+---------------+---------------------------------------------+
| logical_plan  | Filter: column1 = Int64(2)                  |
|               |   Values: (Int64(1)), (Int64(2))            |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
|               |   FilterExec: column1@0 = 2                 |
|               |     ValuesExec                              |
|               |                                             |
+---------------+---------------------------------------------+
2 row(s) fetched. 
Elapsed 0.015 seconds.

As it does not change the result.

Lordworms commented 10 hours ago

take