apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.28k stars 1.19k forks source link

Build physical filter eagerly while constructing ParquetExec operator #2059

Open yjshen opened 2 years ago

yjshen commented 2 years ago

We should avoid leaking logical expression to physical operators.

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_optimizer/pruning.rs#L93-L101

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/file_format/parquet.rs#L77-L86

yjshen commented 2 years ago

@alamb could you please provide some context on why we are using Expr instead of PhyscialExpr for ParquetExec in the first place?

alamb commented 2 years ago

@alamb could you please provide some context on why we are using Expr instead of PhyscialExpr for ParquetExec in the first place?

I don't know of any original rationale -- I think @yordan-pavlov 's initial implementation was in terms of Expr

I think it would be fine to use physical exprs instead (as we can always translate from Expr --> PhysicalExpr)

yordan-pavlov commented 2 years ago

@yjshen when I first introduced parquet predicate push-down for Data Fusion in https://github.com/apache/arrow/pull/9064 I was focused on reusing existing code for evaluating dynamically generated expressions. I don't remember any particular reason for using logical instead of physical expressions. If you think using physical expressions (instead of logical) would make the code better, that's fine with me.

yjshen commented 2 years ago

Thanks @yordan-pavlov for the context! I am asking here to make sure this proposal won't break any hypothesis I'm not aware of.