apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.28k stars 3.47k forks source link

[C++] Extending STL API to support row-wise conversion #22751

Open asfimport opened 5 years ago

asfimport commented 5 years ago

Using array builders is the recommended way in the documentation for converting rowwise data to arrow tables currently. However, array builders has a low level interface to support various use cases in the library. They require additional boilerplate due to type erasure, although some of these boilerplate could be avoided in compile time if the schema is already known and fixed (also discussed in ARROW-4067).

In some other part of the library, STL API provides a nice abstraction over builders by inferring data type and builders from values provided, reducing the boilerplate significantly. It handles automatically converting tuples with a limited set of native types currently: numeric types, string and vector (+ nullable variations of these in case ARROW-6326 is merged). It also allows passing references in tuple values (implemented recently in ARROW-6284).

As a more concrete example, this is the code which can be used to convert row_data provided in examples:  


arrow::Status VectorToColumnarTableSTL(const std::vector<struct data_row>& rows,
                                       std::shared_ptr<arrow::Table>* table) {
    auto rng = rows | ranges::views::transform([](const data_row& row) {
                   return std::tuple<int, double, const std::vector<double>&>(
                       row.id, row.cost, row.cost_components);
               });
    return arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rng,
                                           {"id", "cost", "cost_components"},
                                           table);
}

So, it allows more concise code for consumers of the API compared to using builders directly.

There is no direct support by the library for other types (binary, struct, union etc. types or converting iterable objects other than vectors to lists). Users are provided a way to specialize their own data structures. One limitation for implicit inference is that it is hard (or even impossible) to infer exact type to use in some cases. For example, should std::string_view value be inferred as string, binary, large binary or list? This ambiguity can be avoided by providing some way for user to explicitly state correct type for storing a column. For example a user can return a so called BinaryCell class to return binary values.

Proposed changes:

Reporter: Omer Ozarslan / @ozars

Note: This issue was originally created as ARROW-6377. Please see the migration documentation for further details.

asfimport commented 5 years ago

Omer Ozarslan / @ozars: On a side note, this might have a better performance due to use of compile time knowledge, but it eventually comes down to benchmarking.