apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.55k stars 3.54k forks source link

[C++] Improve Substrait consumer example #33956

Open ianmcook opened 1 year ago

ianmcook commented 1 year ago

Describe the enhancement requested

The example at cpp/examples/arrow/engine_substrait_consumption.cc has some shortcomings that make it less useful than it could be for demonstrating how Acero can consume and execute Substrait plans:

Component(s)

C++

westonpace commented 1 year ago

The extension references and anchors are silly

That was partly by design to emphasize that these can be any arbitrary integers. I find it makes it much more obvious how they are related. For example, if we used 0 instead of 7 then it would not be obvious that extension_uri_reference is a reference to extension_uri_anchor.

include <arrow/engine/api.h> needs to be changed to #include <arrow/engine/substrait/api.h>

That seems a little surprising to me. What error did you get using <arrow/engine/api.h>?

ianmcook commented 1 year ago

The extension references and anchors are silly

That was partly by design to emphasize that these can be any arbitrary integers. I find it makes it much more obvious how they are related. For example, if we used 0 instead of 7 then it would not be obvious that extension_uri_reference is a reference to extension_uri_anchor.

Ah, got it, fair enough.

include <arrow/engine/api.h> needs to be changed to #include <arrow/engine/substrait/api.h>

That seems a little surprising to me. What error did you get using <arrow/engine/api.h>?

That file does not exist when I build and install the Arrow library (using a recent dev commit) :

% tree /usr/local/include/arrow/engine
/usr/local/include/arrow/engine
└── substrait
    ├── api.h
    ├── extension_set.h
    ├── extension_types.h
    ├── options.h
    ├── pch.h
    ├── relation.h
    ├── serde.h
    ├── test_plan_builder.h
    ├── type_fwd.h
    ├── util.h
    └── visibility.h
ianmcook commented 1 year ago

The user needs to find/create a Parquet file to run it

To solve this, maybe we could just add a comment saying something like this:

To create a Parquet file example.parquet to use with this example, run this Python code:

import pyarrow as pa
import pyarrow.parquet as pq
pq.write_table(pa.Table.from_pydict({'i': [1,2,3,4], 'b': [True,False,True,False]}), 'example.parquet')
vibhatha commented 1 year ago

take

westonpace commented 1 year ago

That file does not exist when I build and install the Arrow library (using a recent dev commit) :

Oops. https://github.com/apache/arrow/pull/33964

vibhatha commented 1 year ago

Looking into the example after many iterations since the initial PR, I think the most intuitive and helpful example could be to use a data source with a NamedTableProvider. We already have many test cases wrapped around that.

This means existing code could completely change. What do you think?

cc @westonpace @ianmcook

ianmcook commented 1 year ago

@vibhatha sounds OK to me, but I defer to Weston

westonpace commented 1 year ago

Agreed. Not only is it simpler (don't need to rely on user to create a parquet file) but it demonstrates even more features :)

vibhatha commented 1 year ago

I will modify accordingly. Thanks for the feedback.

kabicm commented 7 months ago

Is there any update on this? What should be the final example showing how we can consume substrait plans with Acero in C++?