apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.32k stars 3.48k forks source link

[R] Add chunk_size to Table$create() #25384

Open asfimport opened 4 years ago

asfimport commented 4 years ago

While working on ARROW-3308, I noticed that write_feather has a chunk_size argument, which by default will write batches of 64k rows into the file. In principle, a chunking strategy like this would prevent the need to bump up to large_utf8 when ingesting a large character vector because you'd end up with many chunks that each fit into a regular utf8 type. However, the way the function works, the data.frame is converted to a Table with all ChunkedArrays containing a single chunk first, which is where the large_utf8 type gets set. But if Table$create() could be instructed to make multiple chunks, this would be resolved.

Reporter: Neal Richardson / @nealrichardson

Related issues:

Note: This issue was originally created as ARROW-9293. Please see the migration documentation for further details.

asfimport commented 4 years ago

Wes McKinney / @wesm: Makes sense to me

asfimport commented 3 years ago

Romain Francois / @romainfrancois: Assuming this comes after https://github.com/apache/arrow/pull/8650 it boils down to vec_to_arrow() accepting some sort of chunking policy, and in turn means that the converter api needs something too (this is essentially https://issues.apache.org/jira/browse/ARROW-5628)

The api we use now in the converter api goes through: 

    Status Extend(SEXP x, int64_t size) override;

which means ingest x and btw it has this number of elements. We need some way to be able to express "ingest that range of elements from x". The Chunker class, at least in its current form does not help. 

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: @romainfrancois Do you still plan to work on this?

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: cc @thisisnic

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.