apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
151 stars 34 forks source link

refactor(python): Reorganize strategies for building arrays #444

Closed paleolimbot closed 2 months ago

paleolimbot commented 2 months ago

This PR reorganizes the implementation for na.c_array(<something>). The motivation here is that there are some Arrow types that aren't supported when creating Arrow arrays (e.g., decimal), and some container types that are not supported (e.g., data frame protocol). I don't think those should be supported right now but I would like there to be an obvious route to adding support. The ArrayBuilder class and the dispatch code that chooses the appropriate class/method is probably not perfect but does provide a more reasonable path to adding support for a new type (add a method to the ArrayFromIterableBuilder class) or new container (add a new ArrayBuilder subclass).

In order to avoid a circular import, this would all have had to be added to c_lib.py. Instead of doing a lazy import, I separated c_lib.py into c_schema.py, c_buffer.py, c_array.py, and c_array_stream.py. The tests for these functions had already been separated and I quite like the organization (reasonably obvious which file you might find the c_array() function in, for example).

paleolimbot commented 2 months ago

Thank you for the review!

Is the code from c_lib.py purely a relocation?

The code within c_array.py was also rearranged (in addition to being moved from c_lib.py). Essentially, the class ArrayBuilder is new (although the implementations wrapped by those classes are not new).

jorisvandenbossche commented 2 months ago

Instead of doing a lazy import, I separated c_lib.py into c_schema.py, c_buffer.py, c_array.py, and c_array_stream.py

FWIW it might have been easier to do this splitting in a separate PR first with just splitting/reorganising code, which would make the diff of the new stuff here easier to review (at this point it is probably harder to still do that, so let it be ;))

jorisvandenbossche commented 2 months ago

The file re-org certainly makes sense! I have to say that I find the logic in the ArrayBuilder (sub)classes a bit harder to follow (e.g. compared to the current _c_array_from_iterable function)

paleolimbot commented 2 months ago

I have to say that I find the logic in the ArrayBuilder (sub)classes a bit harder to follow (e.g. compared to the current _c_array_from_iterable function)

I agree that the original logic was clearer; however, the way it was implemented would never have scaled to support (say) list<int32>. I am not sure that the ArrayBuilder is the final answer to this but it is a step closer towards something that will scale to arbitrary recursive types/string arrays that need to be split into chunks because they contain more than 2 GB of text.