Open jiangzhx opened 1 year ago
i'm not sure,does this pr https://github.com/apache/arrow-datafusion/pull/5419 cause this issue.
I agree that https://github.com/apache/arrow-datafusion/pull/5419 likely caused this issue.
I think the fix is to simply ignore such errors when creating pruning predicates (aka when the predicate is not supported).
What do you think @crepererum ?
To me this looks like a bug. Who's trying to push down / apply a sub-query predicate to a parquet file read? Shouldn't the logical optimizer remove these kind of expressions?
@crepererum I think you're right, just your PR let this issue expose.
these Exists, InSubquery, ScalarSubquery Expr did not complete the corresponding processing in the create_physical_expr method of planner.rs. https://github.com/apache/arrow-datafusion/blob/a1c60a1ba98e089d7551637f2a78663e66772d88/datafusion/physical-expr/src/planner.rs#L501-L503
Before your PR takes effect, it also does not actually handle the subquery scene at datasource https://github.com/apache/arrow-datafusion/blob/a1c60a1ba98e089d7551637f2a78663e66772d88/datafusion/core/src/datasource/file_format/mod.rs#L83-L89
so, i think the way is to optimize non-correlated subquery at decorrelate_where_exists optimizer https://github.com/apache/arrow-datafusion/blob/a1c60a1ba98e089d7551637f2a78663e66772d88/datafusion/optimizer/src/decorrelate_where_exists.rs#L185-L191
the current decorrelate_where_exists only optimize
SELECT t1.id FROM t1
WHERE exists
(
SELECT t2.id FROM t2 WHERE t1.id = t2.id
)
/// and optimizes it into:
SELECT t1.id
FROM t1 LEFT SEMI
JOIN t2
ON t1.id = t2.id
we need process ScalarSubquery in the create_physical_expr and then rewrite non-correlated exists subquery to ScalarSubquery
WHERE EXISTS (SELECT A FROM TABLE B WHERE COL1 > 10)
///will be rewritten to
WHERE (SELECT 1 FROM (SELECT A FROM TABLE B WHERE COL1 > 10) LIMIT 1) IS NOT NULL
For the record this still happens:
(venv-310) andrewlamb@Andrews-MacBook-Pro:~/Downloads$ datafusion-cli
DataFusion CLI v35.0.0
❯ CREATE EXTERNAL TABLE t1 (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
0 rows in set. Query took 0.030 seconds.
❯ CREATE EXTERNAL TABLE t2 (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
0 rows in set. Query took 0.001 seconds.
❯ SELECT a, b FROM t1 WHERE EXISTS (SELECT count(*) FROM t2);
This feature is not implemented: Physical plan does not support logical expression Exists(Exists { subquery: <subquery>, negated: false })
❯
Describe the bug
DataFusion v21.0.0
return: This feature is not implemented: Physical plan does not support logical expression EXISTS ()
worked branch 19.0.0
To Reproduce
create data.csv
echo "1,2" > data.csv
use datafusion-cli
Expected behavior
No response
Additional context
No response