Closed paleolimbot closed 2 months ago
Thank you for the review!
Is the code from c_lib.py purely a relocation?
The code within c_array.py
was also rearranged (in addition to being moved from c_lib.py
). Essentially, the class ArrayBuilder
is new (although the implementations wrapped by those classes are not new).
Instead of doing a lazy import, I separated
c_lib.py
intoc_schema.py
,c_buffer.py
,c_array.py
, andc_array_stream.py
FWIW it might have been easier to do this splitting in a separate PR first with just splitting/reorganising code, which would make the diff of the new stuff here easier to review (at this point it is probably harder to still do that, so let it be ;))
The file re-org certainly makes sense! I have to say that I find the logic in the ArrayBuilder (sub)classes a bit harder to follow (e.g. compared to the current _c_array_from_iterable
function)
I have to say that I find the logic in the ArrayBuilder (sub)classes a bit harder to follow (e.g. compared to the current _c_array_from_iterable function)
I agree that the original logic was clearer; however, the way it was implemented would never have scaled to support (say) list<int32>
. I am not sure that the ArrayBuilder
is the final answer to this but it is a step closer towards something that will scale to arbitrary recursive types/string arrays that need to be split into chunks because they contain more than 2 GB of text.
This PR reorganizes the implementation for
na.c_array(<something>)
. The motivation here is that there are some Arrow types that aren't supported when creating Arrow arrays (e.g., decimal), and some container types that are not supported (e.g., data frame protocol). I don't think those should be supported right now but I would like there to be an obvious route to adding support. TheArrayBuilder
class and the dispatch code that chooses the appropriate class/method is probably not perfect but does provide a more reasonable path to adding support for a new type (add a method to theArrayFromIterableBuilder
class) or new container (add a newArrayBuilder
subclass).In order to avoid a circular import, this would all have had to be added to
c_lib.py
. Instead of doing a lazy import, I separatedc_lib.py
intoc_schema.py
,c_buffer.py
,c_array.py
, andc_array_stream.py
. The tests for these functions had already been separated and I quite like the organization (reasonably obvious which file you might find thec_array()
function in, for example).