apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
173 stars 38 forks source link

[C] Support for unions #73

Closed lidavidm closed 1 year ago

lidavidm commented 1 year ago

While working through the ADBC SQLite driver, I noticed dense union support isn't available. I've also found that writing out the code for maps and lists by hand is a little tedious; would helper functions be useful? (For example, ArrowSchemaInitList(struct ArrowSchema*, enum ArrowType list_type, enum ArrowType element_type, char* list_field_name), where list_field_name == NULL means to insert the default field name. Or maybe even omit that and require you to build it by hand if you need to override that.) It would be especially helpful for maps, though we'd have to be careful to make sure this keeps composing well if you have something like dense_union<..., 4: map<int32, list<int32>>> (which is in the ADBC spec at one point)

lidavidm commented 1 year ago

This could also be split into stages (e.g. just barely enough to initialize a schema by hand at first, then helper functions, appender support, etc.)

paleolimbot commented 1 year ago

Definitely! Unions aren't supported in the R package so I'm not familiar with the details. Something that would help me with that is the output of some Python code that constructs some real-world union types and/or arrays.

lidavidm commented 1 year ago

Ok, I'll see if I can send a PR your way soon. Thanks!

(I don't think they come up very often to be honest, except in the case of databases trying to handle ambiguously typed data, or trying to shove multiply-typed data into a single column. IIRC DuckDB represents it as "just a struct" with some conventions which I think makes it a bit easier to handle without special cases...)

paleolimbot commented 1 year ago

A PR would be great! If you don't get there, an example of a real-world union you'd like to create would do and I can figure something out. This is about to come up in GeoArrow because we need it to represent multiple geometry types.

One of the things I haven't understood in the union is the purpose of the type_ids in the schema: wouldn't either the child names or the child indices suffice?

lidavidm commented 1 year ago

Child names aren't necessarily unique.

Child indices would generally work. I don't know the original justification off the top of my head, but type IDs as separate from indices would let you subset or rearrange a union without rewriting the type IDs array. In practice I don't think I've ever run into a case where type IDs != indices.

lidavidm commented 1 year ago

https://issues.apache.org/jira/browse/ARROW-257 is the actual reason. It was to resolve an early Java/C++ incompatibility: Java used the implementation-defined type ID as the union index, C++ used the child index (as you suggest), and the indirection would allow the two to be compatible. Also you could define a union with more cases than physically present.

lidavidm commented 1 year ago

It seems a little odd, it'd probably be worth rewriting the type IDs buffer unless you were doing more writes than reads? But since the lookup vector is presumably small, it might not matter in practice (and so rewriting the type IDs might not actually be worth it).

paleolimbot commented 1 year ago

Ok, that makes more sense.

For producing, maybe just add a ArrowSchemaInitUnion(schema, n_children) or something that generates the 0,1,2,3,...,n_children string and calls ArrowSchemaAllocateChildren()? For Arrays, maybe ArrowArrayFinishElementUnion(array, char which_child) (called after ArrowArrayAppendSomething(array->children[which_child], ...))?

For consuming, we could add a int8_t* union_type_ids field to the ArrowArrayView that is allocated for union types and is a home for the parsed values? (Or whatever cached lookup table is needed...I'm not sure my understanding of non-standard type ids is all there). And maybe a helper such that ArrowArrayViewGetUnionChildId(array_view, i) will get the correct index into array_view->children[]/schema->children[]?

A similar helper for lists, as you suggest, would be nice for the schemas. Off the top of my head, maybe ArrowSchemaInit(schema, NANOARROW_TYPE_(LARGE_)LIST) should just automatically do ArrowSchemaAllocateChildren(schema, 1) and ArrowSchemaSetName(schema->children[0], "item")? That would mean callers would have to remember to ArrowSchemaInit(schema->children[0]) but that seems more flexible than squashing the child type down to an ID lest it be something like a decimal or fixed-sized binary that can't be initialized with just an ID.

Perhaps ArrowSchemaInit(schema, NANOARROW_TYPE_MAP) could similarly allocate the proper child structure but make the caller do ArrowSchemaInit(schema->children[0]->children[0], key_type) and ArrowSchemaInit(schema->children[0]->children[1], value_type)? Still a little verbose, but more composable than just ArrowSchemaInitMap(schema, key_type_id, value_type_id).

(Also happy to do the ID-based constructors...the 4 people in the world that are creating recursively nested arrays in C probably won't mind the extra few lines it takes to do this without the convenience functions).

lidavidm commented 1 year ago

I like your idea more than my original idea, it composes better. I suppose for unions where type IDs != child indices, we can add a separate function (taking int8_t* or something?)