Open karellincoln opened 9 months ago
I think a more elegant solution would be to implement direct support in pruning for large IN
lists -- the parameter you refer to is effectively rewriting such predicates into OR chains so the existing min/max based evaluation can work on them.
A config parameter is probably fine for the near term.
We have been recently improving the code in this area -- see https://github.com/apache/arrow-datafusion/pull/8440 for example. Maybe we can update the PruningPredicate logic to use the contained
api more to rule out containers based on their min/max values
Specifically, we could figure out the min and max values in the list for contains and then compare the actual min/max values in the columns 🤔
Thanks your advice. Can't wait for it #8440.
We have been recently improving the code in this area -- see https://github.com/apache/arrow-datafusion/pull/8440 for example. Maybe we can update the PruningPredicate logic to use the contained api more to rule out containers based on their min/max values
FYI I think @yahoNanJing is in the process of implementing this feature https://github.com/apache/arrow-datafusion/pull/8669
Is your feature request related to a problem or challenge?
When I use In_list Expr, if the legth of list is 19, it used 6 ms. but when the length grows to 20, it used 200ms.
Describe the solution you'd like
in build_predicate_expression listExpr pruning down only in
in_list.list().len() < 20
I want to config the value.
Describe alternatives you've considered
I think. add a config in ParquetOptions and ParquetExec
but I also think that is ugly, Is there a more elegant implementation?
Additional context
No response