apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
169 stars 35 forks source link

What's the proper/most recommended way to construct a list? #482

Closed cocoa-xu closed 4 months ago

cocoa-xu commented 4 months ago

Hi, I was wondering what's the proper/most recommended way to construct a list? Let's say I'd like to construct a list of int32 (i.e., each row is a list of int32), something like

Row ID data
0 [1,2,3]
1 [4,5,6]

Here is what I'm doing for now:

// create the first row of int32, [1,2,3]
struct ArrowSchema int32_schema{};
struct ArrowArray int32_values{};
struct ArrowError arrow_error{};

ArrowSchemaInit(int32_schema);
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(int32_schema, NANOARROW_TYPE_INT32));
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromSchema(int32_values, int32_schema, arrow_error));
NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(int32_values));
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(int32_values, 1));
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(int32_values, 2));
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(int32_values, 3));
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(int32_values, arrow_error));

// create list of int32
struct ArrowSchema schema{};
struct ArrowArray values{};

ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema, NANOARROW_TYPE_LIST));
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(values, NANOARROW_TYPE_LIST));

And now I'm not sure how I can properly append a list of int32 (i.e., int32_values) to values. Should I use low-level functions like

to construct the list (values) myself, or am I missing something in my code? Any suggestions/examples would be highly appreciated. :)

paleolimbot commented 4 months ago

I think what you might be looking for is ArrowArrayFinishElement() ( https://arrow.apache.org/nanoarrow/latest/reference/c.html#_CPPv423ArrowArrayFinishElementP10ArrowArray ).

struct ArrowSchema schema{};
struct ArrowArray array{};
struct ArrowError error{};

// Make the schema
NANOARROW_RETURN_NOT_OK(ArrowSchemaInit(&schema));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(&schema, NANOARROW_TYPE_LIST));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema.children[0], NANOARROW_TYPE_INT32);

// Build the array
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromSchema(&array, &schema, &error));
NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(&array));

// First element
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 1));
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 2));
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 3));
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(&array));

// Second element
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 4));
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 5));
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array.children[0], 6));
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(&array));

// Finish the outer array
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(int32_values, arrow_error));

Using the ArrowArrayAppendXX() functions can be nice because they protect you from various types of errors (e.g., they will error if you try to append something outside the integer range of the type you're working with). Another option (that requires a little more knowledge of the specification) is to build by buffer:

struct ArrowSchema schema{};
struct ArrowArray array{};
struct ArrowError error{};

// Make the schema
NANOARROW_RETURN_NOT_OK(ArrowSchemaInit(&schema));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(&schema, NANOARROW_TYPE_LIST));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema.children[0], NANOARROW_TYPE_INT32);

struct ArrowBuffer values;
for (int i = 1; i <= 6; i++) {
  NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(&values, i);
}

struct ArrowBuffer offsets;
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(&offsets, 0);
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(&offsets, 3);
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(&offsets, 6);

// Build the array by buffer
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromSchema(&array, &schema, &error));
NANOARROW_RETURN_NOT_OK(ArrowArraySetBuffer(&array, 1, &offsets));
NANOARROW_RETURN_NOT_OK(ArrowArraySetBuffer(array.children[0], 1, &values));
array.length = 2;
array.children[0]->length = 6;
array.null_count = 0;
array.children[0]->null_count = 0;

NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(int32_values, arrow_error));

The nice part about the "build by buffer" method is that you can build the buffers independently (e.g., you can wrap an std::vector as a ArrowBuffer). If you're in C++ or something else with templates, this can often take care of the same types of things that the ArrowArrayAppendXXX() methods do.

I hope all of that helps!

cocoa-xu commented 4 months ago

Massive thanks @paleolimbot! I didn't really know about ArrowArrayFinishElement before. This example definitely helps a lot! I'll probably stick with ArrowArrayAppendXX() for now and see if for my use case if it's worth to switch to the "build by buffer" method.

cocoa-xu commented 4 months ago

Hi @paleolimbot, many thanks for your previous help, and sorry for the ping again. May I ask one more question regarding constructing a nested list? I've been trying to achieve this in the past a few days but still have no clues how to get it done right.

Let's say now each row is a list<list<int32>>, and the goal is to construct the following query results:

Row ID data
0 [[1,2,3], [4,5,6]]
1 [[2,3,4], [5,6,7]]

Following the approach using ArrowArrayAppendXX() functions, I wrote the following function:

Minimal code ```cpp // | Row ID | data | // |--------|----------------------------| // | 0 | [[1,2,3], [4,5,6]] | // | 1 | [[2,3,4], [5,6,7]] | #define NESTING_LEVEL 2 int make_nested_list( struct ArrowSchema* schema, struct ArrowArray* array, struct ArrowError* arrow_error, int level, int row_id) { printf("make_nested_list:level: %d\n", level); if (level == NESTING_LEVEL) { // level == 2 // [ // [ <- schema // int32 <- schema->children[0] // ] // ] // // Make the schema ArrowSchemaInit(schema); NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema, NANOARROW_TYPE_LIST)); NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_INT32)); // Build the array NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromSchema(array, schema, arrow_error)); NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array)); // First element // [1,2,3] if row_id == 0 // [2,3,4] if row_id == 1 NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array->children[0], 1 + row_id)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array->children[0], 2 + row_id)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array->children[0], 3 + row_id)); NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(array)); // Second element // [4,5,6] if row_id == 0 // [5,6,7] if row_id == 1 NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array->children[0], 4 + row_id)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array->children[0], 5 + row_id)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array->children[0], 6 + row_id)); NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(array)); // Finish the outer array // [[1,2,3], [4,5,6]] if row_id == 0 // [[2,3,4], [5,6,7]] if row_id == 1 NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array, arrow_error)); } else { // level == 1 // [ <- schema // [ <- schema->children[0] // int32 // ] // ] // Make the schema for the outer and middle array ArrowSchemaInit(schema); NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema, NANOARROW_TYPE_LIST)); NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_LIST)); // Build the outer and middle array NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromSchema(array, schema, arrow_error)); NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array)); // First row int row_id = 0; NANOARROW_RETURN_NOT_OK(make_nested_list(schema->children[0], array->children[0], arrow_error, level + 1, row_id)); NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(array)); // Second row row_id = 1; NANOARROW_RETURN_NOT_OK(make_nested_list(schema->children[0], array->children[0], arrow_error, level + 1, row_id)); NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(array)); // Finish the outer array // [ <- outer array // [[1,2,3], [4,5,6]], <- first row in the query result // [[2,3,4], [5,6,7]] <- second row in the query result // ] NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array, arrow_error)); } return 0; } struct ArrowSchema schema{}; struct ArrowArray array{}; struct ArrowError error{}; int level = 1; make_nested_list(&schema, &array, &error, level, 0); ```

My understanding is that, we can first make a schema for the outer and middle array:

// Make the schema for the outer and middle array
NANOARROW_RETURN_NOT_OK(ArrowSchemaInit(&schema));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(&schema, NANOARROW_TYPE_LIST));
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema.children[0], NANOARROW_TYPE_LIST);

Then we initialise the middle array using ArrowArrayInitFromSchema,

// Build the outer and middle array
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromSchema(array, schema, arrow_error));
NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array));

and pass schema->children[0] and array->children[0] to make_nested_list with level+1 for row 0 and row 1 respectively, it should in theory construct the inner arrays in array->children[0].

// First row
int row_id = 0;
NANOARROW_RETURN_NOT_OK(make_nested_list(schema->children[0], array->children[0], arrow_error, level + 1, row_id));
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(array));

// Second row
row_id = 1;
NANOARROW_RETURN_NOT_OK(make_nested_list(schema->children[0], array->children[0], arrow_error, level + 1, row_id));
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishElement(array));

But I got an error message saying that Error parsing schema->format: Expected a null-terminated string but found NULL when constructing the outer and middle array (i.e., when level == 1) from ArrowArrayInitFromSchema(array, schema, arrow_error).

I wonder if you could please shed some light on me and let me know which functions I should use? (Or was I fundamentally wrong about how one should construct a nested arrow array?)

paleolimbot commented 4 months ago

I can work up a minimal example tomorrow, but the error message you described is occurring because there is an ArrowSchema* on which ArrowSchemaSetType() was never called (probably the innermost list).

Following the approach using ArrowArrayAppendXX() functions, I wrote the following function:

I'll try to illustrate this in an actual building example tomorrow, but the first thing I would do is to split up the function that builds the ArrowSchema and the function that populates the ArrowArray (which may not be the source of the error but is probably closer to what you would do in actual code anyway).

cocoa-xu commented 4 months ago

Thank you very much for the kind help!