apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.99k stars 3.41k forks source link

[C++] Support file parallelism in AsyncScanner #28181

Closed asfimport closed 3 years ago

asfimport commented 3 years ago

Whether we pull from files in parallel or not is controlled by how we merge the batch streams in AsyncScanner::ScanBatchesUnorderedAsync.  Currently we are relying on MakeConcatenatedGenerator which is incorrect.  This is needed because MakeMergedGenerator pulls from its source (an EnumeratingGenerator) in an async reentrant fashion.  MakeMergedGenerator should not do this.  If some kind of readahead is truly necessary there then use MakeReadaheadGenerator.

Reporter: Weston Pace / @westonpace Assignee: Weston Pace / @westonpace

PRs and other links:

Note: This issue was originally created as ARROW-12386. Please see the migration documentation for further details.

asfimport commented 3 years ago

David Li / @lidavidm: Issue resolved by pull request 10076 https://github.com/apache/arrow/pull/10076