apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.96k stars 3.4k forks source link

[C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema #31176

Closed asfimport closed 2 years ago

asfimport commented 2 years ago

The following query should read a single column from the target parquet file.


open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) %>% collect()

Furthermore, it should apply a pushdown filter to the source node allowing parquet row groups to potentially filter out target data.

At the moment it creates the following exec plan:


3:SinkNode{}
  2:ProjectNode{projection=[l_tax]}
    1:FilterNode{filter=(l_tax < 0.01)}
      0:SourceNode{}

There is no projection or filter in the source node. As a result we end up reading much more data from disk (the entire file) than we need to (at most a single column).

This could be fixed via heuristics in the dplyr code. However, it may quickly get complex (for example, the project comes after the filter, so you need to make sure you push down a projection that includes both the columns accessed by the filter and the columns accessed by the projection OR can you push down the projection through a join [yes you can], how do you know which columns to apply to which source node).

A more complete fix would be to call into some kind of 3rd party optimizer (e.g. calcite) after the plan has been created by dplyr but before it is passed to the execution engine.

Reporter: Weston Pace / @westonpace Assignee: Weston Pace / @westonpace

PRs and other links:

Note: This issue was originally created as ARROW-15726. Please see the migration documentation for further details.

asfimport commented 2 years ago

Neal Richardson / @nealrichardson: We pass filter and columns for predicate pushdown to create the scan node: https://github.com/apache/arrow/blob/master/r/R/query-engine.R#L92

https://github.com/apache/arrow/blob/master/r/src/compute-exec.cpp#L142-L158

So what's not working correctly here?

asfimport commented 2 years ago

Jonathan Keane / @jonkeane: This might be coincidence (though I suspect now...) Our dataset benchmarks are suddenly failing (with at least some of the failures being caused by attempting to read data that shouldn't be being read [1])

Elena has helped narrow down the range of possible commits that this could have happened in:

The first commit this might have happened in is https://github.com/apache/arrow/commit/a935c81b595d24179e115d64cda944efa93aa0e0 and https://github.com/apache/arrow/commit/afaa92e7e4289d6e4f302cc91810368794e8092b it for sure happens in, so it was that commit or before.

Here's an example buildkite log: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/145#ebd7ea7a-3fad-4865-9e73-49200d89ddd6/6-3230

[1] this is a bit in the weeds, so bear with me: The dataset we use for these benchmarks includes data in an early year that causes a schema failure Unsupported cast from string to null using function cast_null. The benchmarks that we wrote cleverly avoid selecting anything from this first year (so if filter pushdown is working, we don't get the error). It has been working (for a while now! even with exec nodes), but suddenly about three days ago, that has actually stopped working and the benchmarks started failing

asfimport commented 2 years ago

Weston Pace / @westonpace: Ok. The projection problem is my fault due to a change I introduced in https://github.com/apache/arrow/commit/a935c81b595d24179e115d64cda944efa93aa0e0 to make it easier to scan. The change created a default behavior if the user did not specify a projection or projected_schema. Unfortunately, if the user supplies a bound projection function but not a projected_schema (which is what R does) then we could (and were) infer the projected_schema from the arguments of the projection expression.

The filter problem wasn't actually a problem but an issue with printing. We do not print the pushdown filter when we print the scan node so I had assumed it wasn't specified but it was, in fact, there.

asfimport commented 2 years ago

Weston Pace / @westonpace: Issue resolved by pull request 12466 https://github.com/apache/arrow/pull/12466