delta-incubator / delta-kernel-rs

A native Delta implementation for integration with any query engine
Apache License 2.0
143 stars 39 forks source link

Predicate filtering does not apply column mapping #424

Open jtanx opened 1 week ago

jtanx commented 1 week ago

Not sure if I'm just doing something wrong but modifying the read-table-single-threaded example to apply a predicate, e.g.

let predicate = Expression::binary(BinaryOperator::GreaterThan, Expression::column("column_name"), 123);
let scan= snapshot
        .into_scan_builder()
        .with_schema_opt(read_schema_opt)
        .with_predicate(Arc::new(predicate))
        .build()?;

This only works if the column name I pass in is the physical name, not the logical name

r3stl355 commented 6 days ago

I think there is more to this, I'm seeing some unexpected results even for physical fields:

  1. Matching on physical String field

    let predicate = Expression::binary(
        BinaryOperator::Equal,
        Expression::Column(ColumnName::new(["id"])),
        Expression::literal("A"),
    );
    (cd kernel/examples/read-table-single-threaded; cargo run -- ../../tests/data/app-txn-checkpoint)

    returns all the records without any filtering. I guess this could be explained by the comment on with_predicate saying /// NOTE: The filtering is best-effort and can produce false positives (rows that should should have been filtered out but were kept).

  2. Matching on physical Integer field

    let predicate = Expression::binary(
        BinaryOperator::GreaterThan,
        Expression::Column(ColumnName::new(["value"])),
        6,
    );
    (cd kernel/examples/read-table-single-threaded; cargo run -- ../../tests/data/app-txn-checkpoint)

    filters out records with value being 1, 2 and 3 but still shows 4, 5, 6, 7 and greater. Could be explained by the same comment as above.

  3. Matching on physical Long field

    let predicate = Expression::binary(
        BinaryOperator::GreaterThan,
        Expression::Column(ColumnName::new(["number"])),
        6 as i64,
    );
    (cd kernel/examples/read-table-single-threaded; cargo run -- ../../../acceptance/tests/dat/out/reader_tests/generated/multi_partitioned/delta/)

    this seems to work fine

  4. Matching on logical (partition) String field

    let predicate = Expression::binary(
        BinaryOperator::GreaterThan,
        Expression::Column(ColumnName::new(["letter"])),
        Expression::literal("a"),
    );
    (cd kernel/examples/read-table-single-threaded; cargo run -- ../../../kernel/tests/data/basic_partitioned/)

    returns 0 records even though there is a partition with letter=a with valid records. I can't find a justification for this.

  5. Lastly, using a predicate with non-existing column name does not throw and just returns no records. That doesn't look like a right behavior to me.

r3stl355 commented 6 days ago

A bit of further digging confirms that the predicate is only used for physical fields so your initial guess was right @jtanx

As for the other issues I listed above, they can probably be explained by examining how predicates work inparquet::arrow::async_reader::ParquetRecordBatchStreamBuilder which is used by the default engine