apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
149 stars 34 forks source link

refactor(python): Extract utility functions into _utils.pyx #529

Closed paleolimbot closed 2 weeks ago

paleolimbot commented 2 weeks ago

This PR moves a number of utility functions that were relatively self-contained into a new module, _utils.pyx.

TODO:

paleolimbot commented 2 weeks ago

Thank you for the review!

Assuming the code changes in _lib.pyx and _utils.pyx are just copying from one place to another

Yes! (with extra documentation added and few typos fixed)

you could also include the utils file in _lib.pyx to keep it as a single module in practice

Is keeping the single module helpful? (I'd assumed that since pyarrow was trying to split up its lib.pyx that it might be good to proactively split up this one before any more features were added; however, I might have misunderstood pyarrow's motivation).

jorisvandenbossche commented 2 weeks ago

Is keeping the single module helpful? (I'd assumed that since pyarrow was trying to split up its lib.pyx that it might be good to proactively split up this one before any more features were added; however, I might have misunderstood pyarrow's motivation).

Compilation time is a motivation for pyarrow AFAIK, which might not really be a concern (yet?) for nanoarrow.

I was wondering a bit to what extent making this into two separate modules would increase the library size of both including nanoarrow c parts, but not entirely sure how this works

paleolimbot commented 2 weeks ago

Compilation time is a motivation for pyarrow AFAIK, which might not really be a concern (yet?) for nanoarrow.

It's starting to get a tiny bit uncomfortable (but this is all relative...it's still only a few seconds!). Part of that is also the fault of the bootstrap (which results in everything getting recompiled always, maybe because it updates the file modified time).

I was wondering a bit to what extent making this into two separate modules would increase the library size of both including nanoarrow c parts, but not entirely sure how this works

That's a great point. We could be smarter about building a shared nanoarrow.so and linking to it instead of giving each module its own copy. I ran a check and it does seem like this PR adds about 100KB to the binary size. I think that is OK for now (and we can revisit some strategies for making that smaller if after separating the buffer, schema, array, and array stream portions it becomes unwieldy).

Before this PR (total: 1158K)

(.venv) dewey@dewey-mac python % ls -lh src/nanoarrow/*.so
-rwxr-xr-x  1 dewey  staff   242K Jun 19 10:00 src/nanoarrow/_ipc_lib.cpython-312-darwin.so
-rwxr-xr-x  1 dewey  staff   916K Jun 19 10:00 src/nanoarrow/_lib.cpython-312-darwin.so

After this PR (1264K)

(.venv) dewey@dewey-mac python % ls -lh src/nanoarrow/*.so
-rwxr-xr-x  1 dewey  staff   242K Jun 19 10:03 src/nanoarrow/_ipc_lib.cpython-312-darwin.so
-rwxr-xr-x  1 dewey  staff   857K Jun 19 10:02 src/nanoarrow/_lib.cpython-312-darwin.so
-rwxr-xr-x  1 dewey  staff   165K Jun 19 10:02 src/nanoarrow/_utils.cpython-312-darwin.so
(.venv) dewey@dewey-mac python % 
jorisvandenbossche commented 2 weeks ago

I ran a check and it does seem like this PR adds about 100KB to the binary size. I think that is OK for now (and we can revisit some strategies for making that smaller if after separating the buffer, schema, array, and array stream portions it becomes unwieldy).

Sounds good