apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[C++][Dataset] Define appropriate abstractions for "fragments" that can handle compute #29771

Open asfimport opened 3 years ago

asfimport commented 3 years ago

This issue has come up in flight (ARROW-10524) and Skyhook (ARROW-13607). In both cases there is a desire to scan data from remote data sources. In both cases the remote data sources can be capable of essentially running their own query engine. I went ahead and created a JIRA to capture some of the discussion.

So maybe this is a question of "how does the datasets API handle distributed query?" which is maybe a subquestion of "what is the future of the datasets API given richer query frontends?"

If we treat datasets API as a simple query engine frontend limited to scan->filter->project->collect|head|count graphs then filtering can be pushed down (and returned with a guarantee) and projection probably can't be pushed down if there are multiple data sources. Head can be pushed down but not count without some effort.

If we're thinking of the datasets API as a scan node for a more general query engine then I think things get complex rather quickly. I'm not sure if the above rules apply or not. For example, a join might combine data from two different source. A filter that compares columns on both sides of the join could not be pushed down. I'm sure these problems are figured out by more general purpose distributed query engines (which presumably slice the query plan into smaller query plans for each individual node).

Reporter: Weston Pace / @westonpace

Related issues:

Note: This issue was originally created as ARROW-14186. Please see the migration documentation for further details.

asfimport commented 3 years ago

Weston Pace / @westonpace: @JayjeetAtGithub @JayjeetAtGithub @lidavidm @cpcloud