apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
169 stars 35 forks source link

feat: String/Binary View Support #596

Closed WillAyd closed 1 week ago

WillAyd commented 1 month ago

closes https://github.com/apache/arrow-nanoarrow/issues/583

WillAyd commented 1 month ago

Hmm looks like the Python process whereby we generate the pxd file does not play nice with the union containing nested struct declarations

paleolimbot commented 1 month ago

I'm on vacation at the moment but I believe you just have to give the anonymous structures names (i.e. declare them as top level types). This is how you would have to do it in cython anyway i believe!

WillAyd commented 1 month ago

Sounds good. Enjoy the time away!

WillAyd commented 4 weeks ago

@paleolimbot started to look at this in more earnest to get a full round trip, instead of just a read. The current read implemenation also only assumes one variadic buffer.

I see the Arrow specification says that binary view types have an extra member which is a buffer containing the lengths of the variadic buffers:

https://arrow.apache.org/docs/format/CDataInterface.html#binary-view-arrays

Do you think its worth adding that to the ArrowArray struct, or should we create a new struct, potentially with a layout like:

struct ArrowBinaryViewArray {
  struct ArrowArray array_data;
  // Variadic buffer lengths for Binary View arrays
  int64_t* variadic_buffer_lengths;
};

So that it can be "safely" cast to/from ArrowArray pointers?

WillAyd commented 4 weeks ago

With that approach the one thing I still wonder is where we should store the length of variadic_buffer_lengths. I didn't see anything for that in the specification, so maybe using a NULL sentinel to denote the end of the array is the requirement?

paleolimbot commented 3 weeks ago

started to look at this in more earnest to get a full round trip, instead of just a read

Awesome!

Do you think its worth adding that to the ArrowArray struct, or should we create a new struct, potentially with a layout like:

I think the ArrowArrayView is the right place for this! From Joris' PR:

https://github.com/apache/arrow-nanoarrow/blob/b3a9894c57e86e763aee07fbde239a2293b95cb8/src/nanoarrow/nanoarrow_types.h#L794-L796

I see the Arrow specification says that binary view types have an extra member

I think this refers to an extra buffer (not an extra struct member):

https://github.com/apache/arrow-nanoarrow/pull/367/files#r1459338465

WillAyd commented 3 weeks ago

Ah OK interesting. Do you know if this is confirmed to be working upstream in Arrow? The reason I ask is that when running the test suite for what's already in this PR, when I hit the C++ Export method:

https://github.com/apache/arrow/blob/c2123b8b90ab952f854912459bb33ebaf0d99611/cpp/src/arrow/c/bridge.cc#L661

The n_buffers being assigned is just 3. I was (likely mistakenly) expecting that to be 4 to account for the validity buffer, inline/prefix buffer, variadic buffer(s), and a buffer that holds the sizes to all variadic buffers

The Arrow docs also read a little vague to me:

int64_t ArrowArray.n_buffers

Mandatory. The number of physical buffers backing this array. The number of buffers is a function of the data type, as described in the Columnar format specification, except for the the binary or utf-8 view type, which has one additional buffer compared to the Columnar format specification (see Binary view arrays).

Buffers of children arrays are not included.

I'm not sure if that is saying that the n_buffers argument should be 4 for the view types, or that it should be 3 and the consumer is expected to know that when working with view types, you actually have n_buffers + 1 buffers attached to the ArrowArray

WillAyd commented 3 weeks ago

Bah ignore what I just said - bad debugging. Digging deeper!

codecov-commenter commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 88.04348% with 11 lines in your changes missing coverage. Please review.

Project coverage is 86.06%. Comparing base (2c42268) to head (8c1e825). Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
src/nanoarrow/testing/testing.cc 14.28% 6 Missing :warning:
src/nanoarrow/common/inline_types.h 0.00% 4 Missing :warning:
src/nanoarrow/common/inline_array.h 98.64% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #596 +/- ## ========================================== + Coverage 85.02% 86.06% +1.04% ========================================== Files 92 94 +2 Lines 11750 13697 +1947 ========================================== + Hits 9990 11788 +1798 - Misses 1760 1909 +149 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

WillAyd commented 2 weeks ago

A few more notes! I am gearing up for a release and it would be very cool to include this if we can...what do you see as blockers for getting this PR across the finish line and can I help with any of them?

Great! I should have time to see this through.

There are no major blockers from my end. Let me know any feedback you have and I will take a look

paleolimbot commented 2 weeks ago

Let me know any feedback you have and I will take a look

Great! I'll give this a thorough checkout + run through + review tomorrow 🙂