Closed Veiasai closed 2 months ago
One more suggestion: Actually, we are able to return dynamic filter push down flag?
pub enum TableProviderFilterPushDown {
/// The expression cannot be used by the provider.
Unsupported,
/// The expression can be used to reduce the data retrieved,
/// but the provider cannot guarantee it will omit all tuples that
/// may be filtered. In this case, DataFusion will apply an additional
/// `Filter` operation after the scan to ensure all rows are filtered correctly.
Inexact,
/// The provider **guarantees** that it will omit **all** tuples that are
/// filtered by the filter expression. This is the fastest option, if available
/// as DataFusion will not apply additional filtering.
Exact,
}
when the expr only includes partition columns, we should return Exact
.
Thanks for taking the time to write a test @Veiasai ! I'll take a look at this shortly
hey, any updates?
@rtyler I have a local fix for this issue - I am not sure on what the delta protocol dictates, but in some of our test tables the partitioning columns would appear in a different order in the json schema and in the partition columns array.
_arrow_schema
uses an iterator + chain + 2 filters on the schema, while the rest of the code (e.g. DeltaScanBuilder.build) will filter them out, then append them explicitly in the order dictated by partition_columns.
This is the essence of my fix
fn _arrow_schema(snapshot: &Snapshot, wrap_partitions: bool) -> DeltaResult<ArrowSchemaRef> {
let meta = snapshot.metadata();
let schema = meta.schema()?;
let fields = schema
.fields()
.filter(|f| !meta.partition_columns.contains(&f.name().to_string()))
.map(|f| f.try_into())
.chain(
// keep consistent order of partitioning columns
meta.partition_columns.iter().map(|partition_col| {
let f = schema.field(partition_col).unwrap();
let field = Field::try_from(f)?;
// ...
LMK if this is enough as a pointer or I should send a PR with this.
@rtyler I've sent a PR just in case https://github.com/delta-io/delta-rs/pull/2614
Would be glad to add some tests if you point me at the correct suite or an example, I was looking for a test with more than one partitioning column and didn't find anything.
Environment
Linux, Rust Delta-rs version: 0.17.3
Binding:
Environment:
Bug
What happened: The filter expr didn't return expected rows. My table is relatively big so I tried to construct a minimal test to reproduce it, see below code. Besides, from what I see in the log, my guess is:
InExact
filter push down, so datafusion apply the same filter again, but however, the physical plan gets wrong column index.What you expected to happen:
How to reproduce it:
I wrote a unit test to check it, but it seems like I don't have permission to push it?
More details: