apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.43k stars 3.51k forks source link

[Website] Announce Acero Engine #31983

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Given consensus on Acero as the name for C++ streaming execution engine, it may be time to write a blog post announcing the engine, how it's currently available in the ecosystem, and what's happening next with it.

Reporter: Will Jones / @wjones127

Related issues:

Note: This issue was originally created as ARROW-16632. Please see the migration documentation for further details.

asfimport commented 2 years ago

Will Jones / @wjones127: Potential outline:

  1. Motivation
    1. Why streaming?
    2. Introduce the name, where codebase lives
    3. Where does this fit in versus other Arrow-based engines (e.g. DuckDB, DataFusion, Velox)? (No SQL or optimizer, low-level and modular design)
  2. Acero's current state

    1. In R arrow package, used as backend for Arrow dplyr queries / Datasets
    2. Results can be retrieved as a Record Batch Reader, which can be passed through Arrow's C data interface. This is used, for example, to create pipelines that do some processing in Acero and some in DuckDB.
    3. In PyArrow, used as backend for datasets(?)
  3. Acero's near future
    1. Substrait support
    2. OpenTelemetry integration
asfimport commented 2 years ago

Weston Pace / @westonpace: This is a good outline.

2 Acero's current state

Acero's public API probably has a good surface area reaching semi-stable (e.g. everything we use for TPC-H) but there are still a few changes planned. Acero's internal API (for exec node developers) is still a bit experimental. There are some significant changes that are going to be made to the scheduler, etc.

2.3 In PyArrow, used as a backend for datasets

I'm not sure what the question mark is here for but this is a correct statement and would be good to mention. I think Pyarrow is also adding more and more dplyr-like functionality built on top of datasets.

asfimport commented 2 years ago

Will Jones / @wjones127:

I'm not sure what the question mark is here for but this is a correct statement and would be good to mention. I think Pyarrow is also adding more and more dplyr-like functionality built on top of datasets. I was just unsure of the state of Python + Acero is. My understanding right now is:

  1. There are some individual functions like join that use Acero under the hood.
  2. Once Ibis is a Substrait producer and Acero a consumer, Ibis will be the (dplyr-equivalent) Python API to write queries with deferred execution.

Does that sound right?

asfimport commented 2 years ago

Ian Cook / @ianmcook: @amol-  is probably in the best position to reply to the above comment about Python + Acero

asfimport commented 2 years ago

Alessandro Molina / @amol-: Analytics functions in PyArrow all use Acero under the hood (filtering, joining, grouping/aggrs). At the moment Acero is never mentioned on purpose in Python and R documentation because we don't have a clear story about what we actually want to expose to end users. In both Python and R acero is a private api that users can't access directly and thus they are undocumented.

The idea is that the interface to Acero in R will be dyplr while in Python there is the intention to make a DataFrame like API and defer to IBIS the lazy evaluation use case. In both cases we won't expose Acero as a public API and the idea is to communicate with Acero only through substrait on long term (That's already being implemented in R, and will be done for Python in the near future)

To wrap-up, I think at the moment we could talk about Acero as the engine powering dyplr in R and Analytics capabilities in Python. And we should talk about Acero itself only from the C++ point of view as probably the R and Python bindings will never expose it to the public.

PS: If we want to start promoting Acero to the world, I think we should work on improving a bit the documentation first. Having a blog post that then redirects people to a docs that they find hard to read/apply might actually be counterproductive as it might create a fame of being badly documented. At the moment the only mention of it is https://arrow.apache.org/docs/cpp/streaming_execution.html and it's not very easy to follow (not much explainations, just blocks of code). In comparison if you look at the compute chapter in Python ( https://arrow.apache.org/docs/dev/python/compute.html ) it's much more talkative and explains things as it goes.

asfimport commented 2 years ago

Will Jones / @wjones127: Perhaps as a way to advertise the extensibility of the engine, we may wish to highlight ARROW-16083 as a case study.