apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

Avoid cast numbers to string in filter #9671

Open waynexia opened 3 months ago

waynexia commented 3 months ago

Is your feature request related to a problem or challenge?

It's not always an expected behavior that casts a number to string in places like WHERE. E.g.:

Create a table with an integer and a string columns.

create table t as (values (1,'1'), (2,'2'),(3,'3'));

Behavior for queries like the following is clear, as the provided constant has the same type with column:

select * from t where column1 > 2;
select * from t where column2 > '2';

But when comparing column1 against a string, like '2' or 'a', type_coercion will add a cast to the column, resulting a plan like this:

❯ explain select * from t where column1 > '2';
+---------------+---------------------------------------------------+
| plan_type     | plan                                              |
+---------------+---------------------------------------------------+
| logical_plan  | Filter: CAST(t.column1 AS Utf8) > Utf8("2")       |
|               |   TableScan: t projection=[column1, column2]      |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192       |
|               |   FilterExec: CAST(column1@0 AS Utf8) > 2         |
|               |     MemoryExec: partitions=1, partition_sizes=[1] |
|               |                                                   |
+---------------+---------------------------------------------------+

This behavior is a little confusing. It will not only slow the execution (filter pushdown in parquet won't work), but the result will also be surprising (12 < '2' because we are comparing two strings).

Describe the solution you'd like

Do not coerce everything to string when comparing another type with it. We should be careful when coercion happens across two type families (like integer to unsigned integer, or float to string etc.).

My proposal in this case is to consider the type of column first. E.g., try to coerce the string literal into the type of column. And throw an error this coercion fails. This can prevent us from executing an incorrect plan in silence. As user can always add their own specific CAST if they want to do string comparison.

This is also closer to PG's behavior:

postgres=# select 12 > '2';
 ?column? 
----------
 t
(1 row)

postgres=# select 12 > 'a';
ERROR:  invalid input syntax for type integer: "a"
LINE 1: select 12 > 'a';

Describe alternatives you've considered

MySQL would also try to cast string to integer. But it won't throw an error when this cast fails. Instead, it would failover the string value to the default number 0 which is more strange...

An alternative is when the cast fails, we fallback to upcast. I.e., select 12 > '2' would treat '2' as integer and select 12 > 'a' would cast 12 into '12'. This can reduce some breaking cases.

Additional context

String is the "biggest" type in coercion, but in this case I realize it's not always acceptable to cast something to string. I would start working on type_coercion rule. I'm not sure if there are any other places that have a similar issue.

And BTW this would be a breaking change.

waynexia commented 3 months ago

PG won't do the cast even if the column type is string and literal is number:

postgres=# select * from t where column2 > 1;
ERROR:  operator does not exist: text > integer
LINE 1: select * from t where column2 > 1;
                                      ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
waynexia commented 3 months ago

Exprs like SELECT '1' IN ('a','b',1) from datafusion/sqllogictest/test_files/expr.slt would also fail

alamb commented 3 months ago

Given I suspect there is no one agreed upon "standard" for this particular behavior, perhaps we can add a configuration option to control if strings are coerced to numeric types 🤔

waynexia commented 3 months ago

Yes, I'm afraid we have to provide both behaviors controlled by a configuration. Otherwise this breaks lots of things.