apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.44k stars 3.52k forks source link

[C++][Parquet] Predicate pushdown through arrow::dataset::ScanBuilder::Filter() not available on list fields #41651

Open rouault opened 5 months ago

rouault commented 5 months ago

Describe the enhancement requested

This enhancement request would be a continuation of the previous enhancement done in https://github.com/apache/arrow/pull/39065 to support nested fields where the nesting type is a struct.

Here I would like to apply a predicate pushdown on the x subfield of a list<element: struct<x: double not null, y: double not null>>

required group field_id=-1 schema {
  optional binary field_id=-1 id (String);
  optional group field_id=-1 geometry (List) {
    repeated group field_id=-1 list {
      optional group field_id=-1 element {
        required double field_id=-1 x;
        required double field_id=-1 y;
      }
    }
  }
}

When trying to apply the following expression as parquet::Dataset::ScanBuilder::Filter(),

auto fieldRefX = arrow::FieldRef(arrow::FieldRef("geometry", "element"), "x");
expression =cp::less_equal(cp::field_ref(arrow::FieldRef(fieldRefX)), cp::literal(m_sFilterEnvelope.MaxX))

I get the following error: nested paths only supported for structs

(I tried to remove that check, but I then get the following error: Function 'struct_field' has no kernel matching input types (list<element: struct<x: double not null, y: double not null>>))

Beyond the technical difficulties in implementing that, I guess there's a potential ambiguity of what such filtering means. Would that mean that a row is selected if all corresponding entries in the list match the predicate, or if just one would. For my use case (spatial filtering directly applied on GeoArrow struct/separated encoded geometry columns, for non-Point geometry types, in GeoParquet files), the later would be what I'm looking for.

CC @jorisvandenbossche @paleolimbot

Component(s)

C++, Parquet

jorisvandenbossche commented 5 months ago

I guess there's a potential ambiguity of what such filtering means. Would that mean that a row is selected if all corresponding entries in the list match the predicate, or if just one would.

Yes, I think this is the crux of the issue, and we would first need some additional scalar kernel that works on list elements together with a reduction (like any/all in case of boolean predicates), such that the resulting kernel is still a scalar kernel for the field (i.e. preserves the shape, and can be used as a filter predicate)