Closed cocoa-xu closed 4 months ago
I think what you might be looking for is ArrowArrayFinishElement()
( https://arrow.apache.org/nanoarrow/latest/reference/c.html#_CPPv423ArrowArrayFinishElementP10ArrowArray ).
struct ArrowSchema schema{};
struct ArrowArray array{};
struct ArrowError error{};
// Make the schema
NANOARROW_RETURN_NOT_OK(ArrowSchemaInit(&schema));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(&schema, NANOARROW_TYPE_LIST));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema.children[0], NANOARROW_TYPE_INT32);
// Build the array
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromSchema(&array, &schema, &error));
NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(&array));
// First element
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 1));
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 2));
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 3));
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(&array));
// Second element
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 4));
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 5));
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 6));
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(&array));
// Finish the outer array
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(int32_values, arrow_error));
Using the ArrowArrayAppendXX()
functions can be nice because they protect you from various types of errors (e.g., they will error if you try to append something outside the integer range of the type you're working with). Another option (that requires a little more knowledge of the specification) is to build by buffer:
struct ArrowSchema schema{};
struct ArrowArray array{};
struct ArrowError error{};
// Make the schema
NANOARROW_RETURN_NOT_OK(ArrowSchemaInit(&schema));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(&schema, NANOARROW_TYPE_LIST));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema.children[0], NANOARROW_TYPE_INT32);
struct ArrowBuffer values;
for (int i = 1; i <= 6; i++) {
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(&values, i);
}
struct ArrowBuffer offsets;
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(&offsets, 0);
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(&offsets, 3);
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(&offsets, 6);
// Build the array by buffer
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromSchema(&array, &schema, &error));
NANOARROW_RETURN_NOT_OK(ArrowArraySetBuffer(&array, 1, &offsets));
NANOARROW_RETURN_NOT_OK(ArrowArraySetBuffer(array.children[0], 1, &values));
array.length = 2;
array.children[0]->length = 6;
array.null_count = 0;
array.children[0]->null_count = 0;
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(int32_values, arrow_error));
The nice part about the "build by buffer" method is that you can build the buffers independently (e.g., you can wrap an std::vector
as a ArrowBuffer
). If you're in C++ or something else with templates, this can often take care of the same types of things that the ArrowArrayAppendXXX()
methods do.
I hope all of that helps!
Massive thanks @paleolimbot! I didn't really know about ArrowArrayFinishElement
before. This example definitely helps a lot! I'll probably stick with ArrowArrayAppendXX()
for now and see if for my use case if it's worth to switch to the "build by buffer" method.
Hi @paleolimbot, many thanks for your previous help, and sorry for the ping again. May I ask one more question regarding constructing a nested list? I've been trying to achieve this in the past a few days but still have no clues how to get it done right.
Let's say now each row is a list<list<int32>>
, and the goal is to construct the following query results:
Row ID | data |
---|---|
0 | [[1,2,3], [4,5,6]] |
1 | [[2,3,4], [5,6,7]] |
Following the approach using ArrowArrayAppendXX()
functions, I wrote the following function:
My understanding is that, we can first make a schema for the outer and middle array:
// Make the schema for the outer and middle array
NANOARROW_RETURN_NOT_OK(ArrowSchemaInit(&schema));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(&schema, NANOARROW_TYPE_LIST));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema.children[0], NANOARROW_TYPE_LIST);
Then we initialise the middle array using ArrowArrayInitFromSchema
,
// Build the outer and middle array
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromSchema(array, schema, arrow_error));
NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array));
and pass schema->children[0]
and array->children[0]
to make_nested_list
with level+1
for row 0 and row 1 respectively, it should in theory construct the inner arrays in array->children[0]
.
// First row
int row_id = 0;
NANOARROW_RETURN_NOT_OK(make_nested_list(schema->children[0], array->children[0], arrow_error, level + 1, row_id));
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(array));
// Second row
row_id = 1;
NANOARROW_RETURN_NOT_OK(make_nested_list(schema->children[0], array->children[0], arrow_error, level + 1, row_id));
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(array));
But I got an error message saying that Error parsing schema->format: Expected a null-terminated string but found NULL
when constructing the outer and middle array (i.e., when level == 1
) from ArrowArrayInitFromSchema(array, schema, arrow_error)
.
I wonder if you could please shed some light on me and let me know which functions I should use? (Or was I fundamentally wrong about how one should construct a nested arrow array?)
I can work up a minimal example tomorrow, but the error message you described is occurring because there is an ArrowSchema*
on which ArrowSchemaSetType()
was never called (probably the innermost list).
Following the approach using ArrowArrayAppendXX() functions, I wrote the following function:
I'll try to illustrate this in an actual building example tomorrow, but the first thing I would do is to split up the function that builds the ArrowSchema
and the function that populates the ArrowArray
(which may not be the source of the error but is probably closer to what you would do in actual code anyway).
Thank you very much for the kind help!
Hi, I was wondering what's the proper/most recommended way to construct a list? Let's say I'd like to construct a list of int32 (i.e., each row is a list of int32), something like
Here is what I'm doing for now:
And now I'm not sure how I can properly append a list of int32 (i.e.,
int32_values
) tovalues
. Should I use low-level functions liketo construct the list (
values
) myself, or am I missing something in my code? Any suggestions/examples would be highly appreciated. :)