apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.06k stars 1.14k forks source link

Config the length of list when using In_list on parquet, rather than a const of 20. #8609

Open karellincoln opened 9 months ago

karellincoln commented 9 months ago

Is your feature request related to a problem or challenge?

When I use In_list Expr, if the legth of list is 19, it used 6 ms. but when the length grows to 20, it used 200ms.

Describe the solution you'd like

in build_predicate_expression listExpr pruning down only in in_list.list().len() < 20 image

I want to config the value.

Describe alternatives you've considered

I think. add a config in ParquetOptions and ParquetExec

but I also think that is ugly, Is there a more elegant implementation?

Additional context

No response

alamb commented 9 months ago

I think a more elegant solution would be to implement direct support in pruning for large IN lists -- the parameter you refer to is effectively rewriting such predicates into OR chains so the existing min/max based evaluation can work on them.

A config parameter is probably fine for the near term.

We have been recently improving the code in this area -- see https://github.com/apache/arrow-datafusion/pull/8440 for example. Maybe we can update the PruningPredicate logic to use the contained api more to rule out containers based on their min/max values

Specifically, we could figure out the min and max values in the list for contains and then compare the actual min/max values in the columns 🤔

karellincoln commented 9 months ago

Thanks your advice. Can't wait for it #8440.

alamb commented 9 months ago

We have been recently improving the code in this area -- see https://github.com/apache/arrow-datafusion/pull/8440 for example. Maybe we can update the PruningPredicate logic to use the contained api more to rule out containers based on their min/max values

FYI I think @yahoNanJing is in the process of implementing this feature https://github.com/apache/arrow-datafusion/pull/8669