Open domodwyer opened 1 year ago
I agree this would be a nice improvement to the pruning logic.
I am not sure how easy this will be to implement with the current implementation of PruningPredicate
-- It may need something more substantial like https://github.com/apache/arrow-datafusion/issues/7887
Yeah I took a brief look and came to a similar conclusion - it seems to be a significant rework to enable this behaviour as-is. #7887 sounds like a great idea!
I think this also affects scanning parquet files with "evolved" schemas -- namely where missing columns are replaced by NULL
For example, in a parquet table that has two files:
file1
: Columns a
and b
file2
: Columns b
(does not have column a
If there is a query with a predicate on a
, like WHERE a = 5
the current logic in the parquet reader will rewrite this to NULL = a
when scanning file2
but is not then smart enough to understand the expression can thus never be true and should skip the file. The same reasoning can be applied to many more complicated expressions that can never evaluate to true for file2
such as
WHERE a = 5 AND b = 'foo;
WHERE CASE WHEN a > 5 THEN b = 'foo' WHEN a < 5 THEN b = 'bar' ELSE false END
WHERE a IN (1,2,3)
Here is an idea on how to extend PruningPredicate to handle this case
PruningPredicate
can't be told about columns that are known to contain only NULL
. It can be told which columns have no nulls (via the PruningStatistics::null_counts()
).
I think we could teach PruningPredicate
about all null colums like this:
PruningStatistics::row_counts()
to get the total row counts in each container. PruningStatistics::row_counts()
and PruningStatistics::null_counts()
to determine containers where columns are entirely NULLNULL
with a NULL
literal and try to simplify the expressions (e.g. a = 5
--> NULL = 5
--> NULL
)For the example in this ticket's description with predicate col_a != A AND col_b='bananas'
where col_b
is not known and the relevant container had 100
rows,
PruningStatistics
would return col_b: {null_count = 100, row_count = 100}
PruningPredicate::prune
would determine col_b
was entirely null, and would rewrite the predicate to be col_a != A AND NULL = 'bananas'
.col_b
and thus could be proven to be not true.I plan to work on this this week
Next steps: @appletreeisyellow and I will write up a proposal and share it around
We will post a proposal sometime this week
Update is that @appletreeisyellow and I have been working on a design for this feature, and we hope to have a proposal later this week
After quite a bit of study, @appletreeisyellow and I have realized that this ticket describes something different than knowing a column is ALL null. This ticket describes handling predicates with only known information and
The predicate can reference columns that are not currently in memory or known at query time.
In the usecase we have in InfluxDB this means getting a predicate that references columns that nothing is known about at all (neither the schema nor the values are known).
I am not sure how common this usecase is across implementations, and I can not come up with any good solution.
Thus, I filed a separate ticket to track handling columns whose types are known, and are known to be all NULL
(which is also a usecase we have in InfluxDB), which I do think is a common usecase that is of general use: https://github.com/apache/arrow-datafusion/issues/9171
Let's focus on https://github.com/apache/arrow-datafusion/issues/9171 and then come back to this idea
Update here is that @appletreeisyellow has much improved the ability of DataFusion to prune when a column is known to be entirely null (aka https://github.com/apache/datafusion/issues/9171)
What remains is the ability to prune when nothing (including not knowing schema) is known about the column which exists in certain places in our InfluxDB 3.0 codebase.
From my perspective, given the potential complexity of implementing this feature in DataFusion we may decide not to work on it ever and pursue other ways of achieving the same end. I don't think we have decided
I think I've gotten this working. I pass in a schema that only has the columns I have stats for. Other columns seem to be ignored. I also wrap the predicate with (<predicate>) IS NOT FALSE
which unless I'm getting my tri state logic wrong means that any nulls in any column means we can't exclude that row/page/row group, allowing for schema evolution in collected statistics.
Is your feature request related to a problem or challenge?
At query time, our use case requires that we evaluate predicates against in-memory data that may have a schema that is a subset of the table schema. The predicate can reference columns that are not currently in memory or known at query time.
For example, given the following in-memory data:
We may have to evaluate a predicate such as
col_a != A AND col_b=bananas
. Wherecol_b
is not present in the in-memory schema / unknown at pruning time, but is a valid column for the table in the system as a whole.Because at query time we have a limited subset of the schema, the schema and statistics provided when constructing the
PruningPredicate
covers onlycol_a, value
.However the
col_a != A
portion of the predicate can be proven FALSE irrespective ofcol_b
. Unfortunately constructing thePruningPredicate
eagerly validates the presence of statistics for all columns in the predicate, and errors stating that there are no fields namedcol_b
before attempting to evaluate any portion of the predicate.Describe the solution you'd like
Attempt to evaluate the predicate based on the available statistics, and return FALSE if possible. If the predicate cannot be proven FALSE, return a "missing column" error as it does today.
For the example above, ideally pruning should return FALSE as it can be proven that
col_a != A
is FALSE even thoughcol_b
is unknown at pruning time.Describe alternatives you've considered
Inserting NULL statistics into the pruning schema to satisfy the presence check - this works around the issue, but unfortunately requires extra processing to prevent the missing field error.
Additional context
This change in behaviour might need sticking behind a flag/option to opt into, rather than being the default.