apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.04k stars 1.14k forks source link

Order of Interval Addition Should Affect Final Output #11055

Open vbarua opened 3 months ago

vbarua commented 3 months ago

Describe the bug

In various engines, the order in which intervals are added to dates can affect the final value. This is especially noticeable with leap years.

Datafusion appears to constant fold these intervals, which throws away the operation order.

> EXPLAIN SELECT
         DATE '2019-02-28' + INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB,
         DATE '2019-02-28' + INTERVAL '1 DAY' + INTERVAL '1 YEAR' AS MAR;
+---------------+----------------------------------------------------------------------+
| plan_type     | plan                                                                 |
+---------------+----------------------------------------------------------------------+
| logical_plan  | Projection: Date32("2020-02-29") AS feb, Date32("2020-02-29") AS mar |
|               |   EmptyRelation                                                      |
| physical_plan | ProjectionExec: expr=[2020-02-29 as feb, 2020-02-29 as mar]          |
|               |   PlaceholderRowExec                                                 |
|               |                                                                      |
+---------------+----------------------------------------------------------------------+

To Reproduce

Testing via datafusion-cli

> SELECT
  DATE '2019-02-28' + INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB,
  DATE '2019-02-28' + INTERVAL '1 DAY' + INTERVAL '1 YEAR' AS MAR;
+------------+------------+
| feb        | mar        |
+------------+------------+
| 2020-02-29 | 2020-02-29 |
+------------+------------+

Expected behavior

Due to leap year shenanigans, adding the year before the day results in a different date than adding the day before the year.

Trino emits

trino> SELECT
    ->   DATE '2019-02-28' + INTERVAL '1' YEAR + INTERVAL '1' DAY AS FEB,
    ->   DATE '2019-02-28' + INTERVAL '1' DAY + INTERVAL '1' YEAR AS MAR;
    FEB     |    MAR     
------------+------------
 2020-02-29 | 2020-03-01 
(1 row)

as does Postgres and Snowflake (based on their documentation for interval types where this example came from)

Additional context

No response

alamb commented 3 months ago

🤔 maybe we could turn of constant folding for adding intervals

Lordworms commented 3 months ago

take

Lordworms commented 3 months ago

This is caused by the adding ordering since the sqlparser would always generate a AST like

image

by only disabling constant folding for Interval would not help here since when we evaluate the PhysicalExpr, it would still do the calculation of the later two components first. I think a change in sqlparser would help? (like we generate the BinaryOP AST with first two components as left leaf and the last one as right leaf) not sure the best way to do it.

alamb commented 3 months ago

I agree with @Lordworms the root cause is related to the precidence rules in sqlparser which control if expressions like

  DATE '2019-02-28' + INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB,

are parsed like this

  ((DATE '2019-02-28' + INTERVAL '1 YEAR') + INTERVAL '1 DAY' AS FEB),

Or like

  (DATE '2019-02-28' + (INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB)),

Users who want to control the order of expression evaluation can use ( and ) to explicitly control the evaluation order.

Lordworms commented 3 months ago

I agree with @Lordworms the root cause is related to the precidence rules in sqlparser which control if expressions like

  DATE '2019-02-28' + INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB,

are parsed like this

  ((DATE '2019-02-28' + INTERVAL '1 YEAR') + INTERVAL '1 DAY' AS FEB),

Or like

  (DATE '2019-02-28' + (INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB)),

Users who want to control the order of expression evaluation can use ( and ) to explicitly control the evaluation order.

Thanks so much. Should I fix it in sqlparser?

alamb commented 3 months ago

Thanks so much. Should I fix it in sqlparser?

I think that is probably the best way -- though it is likely pretty tricky (likely related to operator precidence). I think the correct first step would be to do some research and see how such arithmetic is parsed by other databases