apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.45k stars 3.52k forks source link

[Java] Push-down filtering in Java #14782

Open noahfrn opened 1 year ago

noahfrn commented 1 year ago

Push-down filtering (Java) questions

Hey all,

I'm exploring adding push-down filtering to the Java Datasets API, as it currently only supports push-down projection. Just had a couple of questions before I start this work:

Thanks!

Component(s)

C++, Java

lidavidm commented 1 year ago

CC @lwhite1 @davisusanibar

An alternative might be to accept Substrait plans instead, binding to Acero as a whole instead of just Dataset. This would give you the full power of the query engine and avoid having to create bindings to individual C++ components. But of course this is more complex and Substrait is a bit of a moving target.

Another alternative is to track the discussion about a text format for expressions. See https://lists.apache.org/thread/7vch27t3gfz1hmv7d8w69n50gfc1nswf and #14287.

noahfrn commented 1 year ago

Thanks for your help @lidavidm - that expression parsing PR looks like exactly what I wanted!

ianmcook commented 1 year ago

There is a discussion here about passing Substrait expressions to the Dataset Project and Filter methods: https://github.com/apache/arrow/issues/33985#issuecomment-1435319522

If this gets implemented in C++, can it be exposed to Java through the JNI bindings?

lidavidm commented 1 year ago

I think this is just about having a convenient user-facing API to the existing Dataset functionality, a Substrait API to Acero in Java would be a separate project

ianmcook commented 1 year ago

Ok, thanks. To clarify: my question is not about Substrait plans, it's about Substrait expressions which we now have a way to represent independent of plans. I opened a separate issue to request this feature: https://github.com/apache/arrow/issues/34252

lidavidm commented 1 year ago

I think without a convenient API to build Substrait expressions in Java, it'd still not quite meet the goals here right?

noahfrn commented 1 year ago

Agreed @lidavidm - Something like https://github.com/substrait-io/substrait-java/issues/128 would provide the necessary functionality here.

ianmcook commented 1 year ago

https://github.com/apache/arrow/issues/34252 is now complete, and the Arrow Java Datasets API now supports pushdown projection and filtering using Substrait expressions.

More details here: https://github.com/apache/arrow/blob/main/docs/source/java/dataset.rst#projection-produce-new-columns-and-filters

Example here: https://github.com/apache/arrow/blob/main/docs/source/java/substrait.rst#executing-projections-and-filters-using-extended-expressions

However: this capability is not really ready for practical applications yet, because we do not yet have any user-friendly tools to create Substrait expressions in Java. I hope we can achieve that in https://github.com/substrait-io/substrait-java/issues/128.