Open jtanx opened 1 week ago
I think there is more to this, I'm seeing some unexpected results even for physical fields:
Matching on physical String field
let predicate = Expression::binary(
BinaryOperator::Equal,
Expression::Column(ColumnName::new(["id"])),
Expression::literal("A"),
);
(cd kernel/examples/read-table-single-threaded; cargo run -- ../../tests/data/app-txn-checkpoint)
returns all the records without any filtering. I guess this could be explained by the comment on with_predicate
saying /// NOTE: The filtering is best-effort and can produce false positives (rows that should should have been filtered out but were kept).
Matching on physical Integer field
let predicate = Expression::binary(
BinaryOperator::GreaterThan,
Expression::Column(ColumnName::new(["value"])),
6,
);
(cd kernel/examples/read-table-single-threaded; cargo run -- ../../tests/data/app-txn-checkpoint)
filters out records with value
being 1, 2 and 3 but still shows 4, 5, 6, 7 and greater. Could be explained by the same comment as above.
Matching on physical Long field
let predicate = Expression::binary(
BinaryOperator::GreaterThan,
Expression::Column(ColumnName::new(["number"])),
6 as i64,
);
(cd kernel/examples/read-table-single-threaded; cargo run -- ../../../acceptance/tests/dat/out/reader_tests/generated/multi_partitioned/delta/)
this seems to work fine
Matching on logical (partition) String field
let predicate = Expression::binary(
BinaryOperator::GreaterThan,
Expression::Column(ColumnName::new(["letter"])),
Expression::literal("a"),
);
(cd kernel/examples/read-table-single-threaded; cargo run -- ../../../kernel/tests/data/basic_partitioned/)
returns 0 records even though there is a partition with letter=a
with valid records. I can't find a justification for this.
Lastly, using a predicate with non-existing column name does not throw and just returns no records. That doesn't look like a right behavior to me.
A bit of further digging confirms that the predicate is only used for physical fields so your initial guess was right @jtanx
As for the other issues I listed above, they can probably be explained by examining how predicates work inparquet::arrow::async_reader::ParquetRecordBatchStreamBuilder
which is used by the default engine
Not sure if I'm just doing something wrong but modifying the read-table-single-threaded example to apply a predicate, e.g.
This only works if the column name I pass in is the physical name, not the logical name