apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.89k stars 1.11k forks source link

binary_op should be handled by sql_expr_to_logical_expr.. This was likely caused by a bug in DataFusion's code #10310

Closed Abdullahsab3 closed 1 day ago

Abdullahsab3 commented 4 months ago

Describe the bug

When casting a string to a timestamp with time zone as an argument, I get the following error:


Error while planning query: Internal error: binary_op should be handled by sql_expr_to_logical_expr..
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker:

To Reproduce

The query that I used:

select 
    date_bin(interval '1 hour', time at time zone 'Europe/Brussels') 
from 
    raw_data 
where 
    time >= '2019-03-27T22:00:00.000Z'::timestamp at time zone 'Europe/Brussels'
and 
    time <= '2025-03-26T00:00:00.000Z'::timestamp at time zone 'Europe/Brussels'

Expected behavior

It should be casted correctly with no issues

Additional context

No response

alamb commented 4 months ago

Thank you for the report -- this definitely looks like a bug to me

jayzhan211 commented 4 months ago

An example to reproduce.

statement ok
create table t1 as values (
    date_bin(interval '1 hour', '2022-08-03 14:38:50Z' at time zone 'Europe/Brussels'), 
    date_bin(interval '1 hour', '2022-08-03 14:38:50Z' at time zone 'Europe/Brussels'));

query error
select * from t1 where column1 >= '2019-03-27T22:00:00.000Z'::timestamp at time zone 'Europe/Brussels';
----
DataFusion error: Internal error: binary_op should be handled by sql_expr_to_logical_expr..
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

sql we got in sql_expr_to_logical_expr

sql: AtTimeZone { timestamp: BinaryOp { left: Identifier(Ident { value: "column1", quote_style: None }), op: GtEq, right: Cast { expr: Value(SingleQuotedString("2019-03-27T22:00:00.000Z")), data_type: Timestamp(None, None), format: None } }, time_zone: "Europe/Brussels" }

We need to support AtTimeZone, and maybe timezone casting 🤔 ?

Related code is in https://github.com/apache/datafusion/blob/286eb3445d1b78661c32fea4c14ac5c59559f82d/datafusion/sql/src/expr/mod.rs#L69-L98

Abdullahsab3 commented 4 months ago

Thanks for the deep dive @jayzhan211! Could it be that the expression is not getting correctly parsed in this instance? Executing this query for example:

select * from t1 where column1 >= ('2019-03-27T22:00:00.000Z'::timestamp at time zone 'Europe/Brussels');

seems to be working fine with no issues

jayzhan211 commented 4 months ago

Yes, you are right!

This is one with ()

sql: BinaryOp { left: Identifier(Ident { value: "column1", quote_style: None }), op: GtEq, right: Nested(AtTimeZone { timestamp: Cast { expr: Value(SingleQuotedString("2019-03-27T22:00:00.000Z")), data_type: Timestamp(None, None), format: None }, time_zone: "Europe/Brussels" }) }
sql: AtTimeZone { timestamp: Cast { expr: Value(SingleQuotedString("2019-03-27T22:00:00.000Z")), data_type: Timestamp(None, None), format: None }, time_zone: "Europe/Brussels" }

This one is without ()

sql: AtTimeZone { timestamp: BinaryOp { left: Identifier(Ident { value: "column1", quote_style: None }), op: GtEq, right: Cast { expr: Value(SingleQuotedString("2019-03-27T22:00:00.000Z")), data_type: Timestamp(None, None), format: None } }, time_zone: "Europe/Brussels" }

I think it is parsed as (column1 >= '2019-03-27T22:00:00.000Z'::timestamp) at time zone 'Europe/Brussels' which ideally should be parsed as column1 >= ('2019-03-27T22:00:00.000Z'::timestamp at time zone 'Europe/Brussels')

dmitrybugakov commented 4 months ago

@jayzhan211

I understand that the issue runs a bit deeper as we employ sqlparser for converting SQL queries into statements.

"SELECT c FROM t WHERE c >= '2019-03-27T22:00:00.000Z'::timestamp at time zone 'Europe/Brussels'"

Processing stmts from tokenizer:

[Statement(Query(Query { with: None, body: Select(Select { distinct: None, top: None, projection: [UnnamedExpr(Identifier(Ident { value: "c", quote_style: None }))], into: None, from: [TableWithJoins { relation: Table { name: ObjectName([Ident { value: "t", quote_style: None }]), alias: None, args: None, with_hints: [], version: None, partitions: [] }, joins: [] }], lateral_views: [], selection: Some(AtTimeZone { timestamp: BinaryOp { left: Identifier(Ident { value: "c", quote_style: None }), op: GtEq, right: Cast { expr: Value(SingleQuotedString("2019-03-27T22:00:00.000Z")), data_type: Timestamp(None, None), format: None } }, time_zone: "Europe/Brussels" }), group_by: Expressions([]), cluster_by: [], distribute_by: [], sort_by: [], having: None, named_window: [], qualify: None, value_table_mode: None }), order_by: [], limit: None, limit_by: [], offset: None, fetch: None, locks: [], for_clause: None }))]

https://github.com/apache/datafusion/blob/dac2a7edc9eefc8b5c5ddf7320a8fbade432a1dd/datafusion/sql/src/parser.rs#L315

jayzhan211 commented 4 months ago

@jayzhan211

I understand that the issue runs a bit deeper as we employ sqlparser for converting SQL queries into statements.

"SELECT c FROM t WHERE c >= '2019-03-27T22:00:00.000Z'::timestamp at time zone 'Europe/Brussels'"

Processing stmts from tokenizer:

[Statement(Query(Query { with: None, body: Select(Select { distinct: None, top: None, projection: [UnnamedExpr(Identifier(Ident { value: "c", quote_style: None }))], into: None, from: [TableWithJoins { relation: Table { name: ObjectName([Ident { value: "t", quote_style: None }]), alias: None, args: None, with_hints: [], version: None, partitions: [] }, joins: [] }], lateral_views: [], selection: Some(AtTimeZone { timestamp: BinaryOp { left: Identifier(Ident { value: "c", quote_style: None }), op: GtEq, right: Cast { expr: Value(SingleQuotedString("2019-03-27T22:00:00.000Z")), data_type: Timestamp(None, None), format: None } }, time_zone: "Europe/Brussels" }), group_by: Expressions([]), cluster_by: [], distribute_by: [], sort_by: [], having: None, named_window: [], qualify: None, value_table_mode: None }), order_by: [], limit: None, limit_by: [], offset: None, fetch: None, locks: [], for_clause: None }))]

https://github.com/apache/datafusion/blob/dac2a7edc9eefc8b5c5ddf7320a8fbade432a1dd/datafusion/sql/src/parser.rs#L315

yes, I think we should fix sqlparser to get ('2019-03-27T22:00:00.000Z'::timestamp at time zone 'Europe/Brussels') as the right hand side of binary operation

Abdullahsab3 commented 1 day ago

I think this issue is fixed now. To reproduce:

> create or replace table t AS
VALUES
  ('2024-04-30T21:30:00'),
  ('2024-04-30T22:30:00'),
  ('2024-04-30T23:30:00'),
  ('2024-05-01T00:00:00'),
  ('2024-05-01T00:30:00'),
  ('2024-05-01T10:30:00'),
  ('2024-05-01T20:30:00')
;
> select * from t where column1 >= '2019-03-27T22:00:00.000Z'::timestamp  at time zone 'UTC' at time zone 'Europe/Brussels';
+---------------------+
| column1             |
+---------------------+
| 2024-04-30T21:30:00 |
| 2024-04-30T22:30:00 |
| 2024-04-30T23:30:00 |
| 2024-05-01T00:00:00 |
| 2024-05-01T00:30:00 |
| 2024-05-01T10:30:00 |
| 2024-05-01T20:30:00 |
+---------------------+
7 row(s) fetched.
Elapsed 0.002 seconds.

> select * from t where column1 >= '2019-03-27T22:00:00.000Z'::timestamp at time zone 'Europe/Brussels';
+---------------------+
| column1             |
+---------------------+
| 2024-04-30T21:30:00 |
| 2024-04-30T22:30:00 |
| 2024-04-30T23:30:00 |
| 2024-05-01T00:00:00 |
| 2024-05-01T00:30:00 |
| 2024-05-01T10:30:00 |
| 2024-05-01T20:30:00 |
+---------------------+
7 row(s) fetched.
Elapsed 0.021 seconds.

Thank you @dmitrybugakov and @jayzhan211 !