apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.2k stars 1.17k forks source link

Evaluates COALESCE arguments past first non-NULL value #8910

Open tv42 opened 9 months ago

tv42 commented 9 months ago

Describe the bug

Datafusion evaluates COALESCE(a, b, c) expressions even past the first non-NULL value, and bubbles up runtime errors from them.

This is wrong because the COALESCE could be explicitly protecting against e.g. the divide-by-zero case, and differs from other sql engines.

To Reproduce

❯ SELECT COALESCE(1, 2/0);
Optimizer rule 'simplify_expressions' failed
caused by
Arrow error: Divide by zero error

Expected behavior

COALESCE to return the first non-NULL value, not bubble up errors from the rest.

Compare to SQLite:

sqlite> SELECT COALESCE(1, 2/0);
1

Compare to Postgres:

postgres=# SELECT COALESCE(1, 2/0);
 coalesce
----------
        1
(1 row)

Additional context

No response

tv42 commented 9 months ago

This might share the root cause with #8909.

alamb commented 9 months ago

This may be fixed by https://github.com/apache/arrow-datafusion/pull/8928

jackwener commented 9 months ago

This may be fixed by https://github.com/apache/arrow-datafusion/pull/8928

This issue is related with #8928 but is different with it.

Let me have explain this problem: It's cause by expression eval.

When we try to fold expression, we will eval expression. When we eval expression, it may occur Runtime Error. But short-circuit expression can be short-circuit in Runtime, so I think we should skip some error and then use original (a possible solution) expression when fold short-circuit expression.

8928 avoid short-circuit expression to extract common expression, but can't avoid fold expression

how do you think about it? @alamb @liukun4515

BTW, Are @haohuaijin interested in this? If yes, when you meet some problem or need review, please feel free to @me.

haohuaijin commented 9 months ago

Thank you @jackwener. I am interested in it. I have two solutions.

First

We can first fold the expression whenever, and if we encounter an error, return the origin expression in the below code, and the error will occur in runtime if it is a real error(not the short-circuit problem). https://github.com/apache/arrow-datafusion/blob/2b218be67a6c412629530b812836a6cec76efc32/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L282-L287

Second

don't simplify short-circuit expressions' all sub-expressions.


I think this issue is also related to https://github.com/apache/arrow-datafusion/issues/8927, because the scalar function first evaluates all args and then execution the actual function, so even though we don't fold the expression, the query in this issue will also throw error. I comment out SimplifyExpressions in the main branch and got the following result.

❯ SELECT COALESCE(1, 2/0);
Arrow error: Divide by zero error
❯ explain SELECT COALESCE(1, 2/0);
+---------------+-------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                |
+---------------+-------------------------------------------------------------------------------------+
| logical_plan  | Projection: coalesce(Int64(1), Int64(2) / Int64(0))                                 |
|               |   EmptyRelation                                                                     |
| physical_plan | ProjectionExec: expr=[coalesce(1, 2 / 0) as coalesce(Int64(1),Int64(2) / Int64(0))] |
|               |   PlaceholderRowExec                                                                |
|               |                                                                                     |
+---------------+-------------------------------------------------------------------------------------+
2 rows in set. Query took 0.007 seconds.
alamb commented 9 months ago

We can first fold the expression whenever, and if we encounter an error, return the origin expression in the below code, and the error will occur in runtime if it is a real error(not the short-circuit problem).

This makes sense to me @haohuaijin and @jackwener - I also played around with posgres and this seems consistent with what it does.