Don't error on unknown column when pruning if predicate can still be proven false

domodwyer commented 1 year ago

Is your feature request related to a problem or challenge?

At query time, our use case requires that we evaluate predicates against in-memory data that may have a schema that is a subset of the table schema. The predicate can reference columns that are not currently in memory or known at query time.

For example, given the following in-memory data:

col_a	value
A	42

We may have to evaluate a predicate such as col_a != A AND col_b=bananas. Where col_b is not present in the in-memory schema / unknown at pruning time, but is a valid column for the table in the system as a whole.

Because at query time we have a limited subset of the schema, the schema and statistics provided when constructing the PruningPredicate covers only col_a, value.

However the col_a != A portion of the predicate can be proven FALSE irrespective of col_b. Unfortunately constructing the PruningPredicate eagerly validates the presence of statistics for all columns in the predicate, and errors stating that there are no fields named col_b before attempting to evaluate any portion of the predicate.

Describe the solution you'd like

Attempt to evaluate the predicate based on the available statistics, and return FALSE if possible. If the predicate cannot be proven FALSE, return a "missing column" error as it does today.

For the example above, ideally pruning should return FALSE as it can be proven that col_a != A is FALSE even though col_b is unknown at pruning time.

Describe alternatives you've considered

Inserting NULL statistics into the pruning schema to satisfy the presence check - this works around the issue, but unfortunately requires extra processing to prevent the missing field error.

Additional context

This change in behaviour might need sticking behind a flag/option to opt into, rather than being the default.

alamb commented 1 year ago

I agree this would be a nice improvement to the pruning logic.

alamb commented 1 year ago

I am not sure how easy this will be to implement with the current implementation of PruningPredicate -- It may need something more substantial like https://github.com/apache/arrow-datafusion/issues/7887

domodwyer commented 1 year ago

Yeah I took a brief look and came to a similar conclusion - it seems to be a significant rework to enable this behaviour as-is. #7887 sounds like a great idea!

alamb commented 10 months ago

I think this also affects scanning parquet files with "evolved" schemas -- namely where missing columns are replaced by NULL

For example, in a parquet table that has two files:

file1: Columns a and b
file2: Columns b (does not have column a

If there is a query with a predicate on a, like WHERE a = 5 the current logic in the parquet reader will rewrite this to NULL = a when scanning file2 but is not then smart enough to understand the expression can thus never be true and should skip the file. The same reasoning can be applied to many more complicated expressions that can never evaluate to true for file2 such as

WHERE a = 5 AND b = 'foo;
WHERE CASE WHEN a > 5 THEN b = 'foo' WHEN a < 5 THEN b = 'bar' ELSE false END
WHERE a IN (1,2,3)

alamb commented 10 months ago

Here is an idea on how to extend PruningPredicate to handle this case

Problem:

PruningPredicate can't be told about columns that are known to contain only NULL. It can be told which columns have no nulls (via the PruningStatistics::null_counts()).

I think we could teach PruningPredicate about all null colums like this:

Add a new method PruningStatistics::row_counts() to get the total row counts in each container.
Use the information from PruningStatistics::row_counts() and PruningStatistics::null_counts() to determine containers where columns are entirely NULL
Rewrite the predicate, replacing references to columns known to be NULL with a NULL literal and try to simplify the expressions (e.g. a = 5 --> NULL = 5 --> NULL)

For the example in this ticket's description with predicate col_a != A AND col_b='bananas' where col_b is not known and the relevant container had 100 rows,

the relevant PruningStatistics would return col_b: {null_count = 100, row_count = 100}
PruningPredicate::prune would determine col_b was entirely null, and would rewrite the predicate to be col_a != A AND NULL = 'bananas'.
The pruning rewrite would happen again, and this time would not try to fetch min/max statistics for col_b and thus could be proven to be not true.

alamb commented 10 months ago

I plan to work on this this week

alamb commented 10 months ago

Next steps: @appletreeisyellow and I will write up a proposal and share it around

appletreeisyellow commented 10 months ago

We will post a proposal sometime this week

alamb commented 9 months ago

Update is that @appletreeisyellow and I have been working on a design for this feature, and we hope to have a proposal later this week

alamb commented 9 months ago

After quite a bit of study, @appletreeisyellow and I have realized that this ticket describes something different than knowing a column is ALL null. This ticket describes handling predicates with only known information and

The predicate can reference columns that are not currently in memory or known at query time.

In the usecase we have in InfluxDB this means getting a predicate that references columns that nothing is known about at all (neither the schema nor the values are known).

I am not sure how common this usecase is across implementations, and I can not come up with any good solution.

Thus, I filed a separate ticket to track handling columns whose types are known, and are known to be all NULL (which is also a usecase we have in InfluxDB), which I do think is a common usecase that is of general use: https://github.com/apache/arrow-datafusion/issues/9171

alamb commented 9 months ago

Let's focus on https://github.com/apache/arrow-datafusion/issues/9171 and then come back to this idea

alamb commented 6 months ago

Update here is that @appletreeisyellow has much improved the ability of DataFusion to prune when a column is known to be entirely null (aka https://github.com/apache/datafusion/issues/9171)

What remains is the ability to prune when nothing (including not knowing schema) is known about the column which exists in certain places in our InfluxDB 3.0 codebase.

From my perspective, given the potential complexity of implementing this feature in DataFusion we may decide not to work on it ever and pursue other ways of achieving the same end. I don't think we have decided

adriangb commented 1 month ago

I think I've gotten this working. I pass in a schema that only has the columns I have stats for. Other columns seem to be ignored. I also wrap the predicate with (<predicate>) IS NOT FALSE which unless I'm getting my tri state logic wrong means that any nulls in any column means we can't exclude that row/page/row group, allowing for schema evolution in collected statistics.

apache / datafusion