SQL case, This feature is not implemented: Physical plan does not support logical expression EXISTS (<subquery>)

jiangzhx commented 1 year ago

Describe the bug

DataFusion v21.0.0

CREATE EXTERNAL TABLE t1 (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
CREATE EXTERNAL TABLE t2 (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
SELECT a, b FROM t1 WHERE EXISTS (SELECT count(*) FROM t2);

return: This feature is not implemented: Physical plan does not support logical expression EXISTS ()

worked branch 19.0.0

To Reproduce

create data.csv

echo "1,2" > data.csv

use datafusion-cli


CREATE EXTERNAL TABLE t1 (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
CREATE EXTERNAL TABLE t2 (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
SELECT a, b FROM t1 WHERE EXISTS (SELECT count(*) FROM t2);

Expected behavior

No response

Additional context

No response

jiangzhx commented 1 year ago

i'm not sure,does this pr https://github.com/apache/arrow-datafusion/pull/5419 cause this issue.

alamb commented 1 year ago

I agree that https://github.com/apache/arrow-datafusion/pull/5419 likely caused this issue.

I think the fix is to simply ignore such errors when creating pruning predicates (aka when the predicate is not supported).

What do you think @crepererum ?

crepererum commented 1 year ago

To me this looks like a bug. Who's trying to push down / apply a sub-query predicate to a parquet file read? Shouldn't the logical optimizer remove these kind of expressions?

jiangzhx commented 1 year ago

@crepererum I think you're right, just your PR let this issue expose.

these Exists, InSubquery, ScalarSubquery Expr did not complete the corresponding processing in the create_physical_expr method of planner.rs. https://github.com/apache/arrow-datafusion/blob/a1c60a1ba98e089d7551637f2a78663e66772d88/datafusion/physical-expr/src/planner.rs#L501-L503

Before your PR takes effect, it also does not actually handle the subquery scene at datasource https://github.com/apache/arrow-datafusion/blob/a1c60a1ba98e089d7551637f2a78663e66772d88/datafusion/core/src/datasource/file_format/mod.rs#L83-L89

so, i think the way is to optimize non-correlated subquery at decorrelate_where_exists optimizer https://github.com/apache/arrow-datafusion/blob/a1c60a1ba98e089d7551637f2a78663e66772d88/datafusion/optimizer/src/decorrelate_where_exists.rs#L185-L191

the current decorrelate_where_exists only optimize

SELECT t1.id FROM t1
WHERE exists
(
   SELECT t2.id FROM t2 WHERE t1.id = t2.id
)
/// and optimizes it into:
SELECT t1.id
FROM t1 LEFT SEMI
JOIN t2
ON t1.id = t2.id

we need process ScalarSubquery in the create_physical_expr and then rewrite non-correlated exists subquery to ScalarSubquery

WHERE EXISTS (SELECT A FROM TABLE B WHERE COL1 > 10)
///will be rewritten to
WHERE (SELECT 1 FROM (SELECT A FROM TABLE B WHERE COL1 > 10) LIMIT 1) IS NOT NULL

alamb commented 5 months ago

For the record this still happens:

(venv-310) andrewlamb@Andrews-MacBook-Pro:~/Downloads$ datafusion-cli
DataFusion CLI v35.0.0
❯ CREATE EXTERNAL TABLE t1 (a INT, b INT) STORED AS CSV LOCATION 'data.csv';

0 rows in set. Query took 0.030 seconds.

❯ CREATE EXTERNAL TABLE t2 (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
0 rows in set. Query took 0.001 seconds.

❯ SELECT a, b FROM t1 WHERE EXISTS (SELECT count(*) FROM t2);

This feature is not implemented: Physical plan does not support logical expression Exists(Exists { subquery: <subquery>, negated: false })
❯

apache / datafusion