apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
169 stars 35 forks source link

feat: Support for Binaryview and StringView types #367

Closed jorisvandenbossche closed 1 week ago

jorisvandenbossche commented 8 months ago

Very draft PR, just putting here publicly what I experimented.

This is currently a minimal addition that just allows to inspect an array / type:

In [1]: builder = pa.lib.StringViewBuilder()
   ...: builder.append("test")
   ...: builder.append("very long string that is not inlined")
   ...: builder.append(None)
   ...: builder.append("test")
   ...: arr = builder.finish()

In [2]: import nanoarrow as na

In [3]: na.c_schema(arr.type)
Out[3]: 
<nanoarrow.c_lib.CSchema string_view>
- format: 'vu'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:

In [4]: na.c_schema_view(arr.type)
Out[4]: 
<nanoarrow.c_lib.CSchemaView>
- type: 'string_view'
- storage_type: 'string_view'

In [5]: na.c_array(arr)
Out[5]: 
<nanoarrow.c_lib.CArray string_view>
- length: 4
- offset: 0
- null_count: 1
- buffers: (140250311598080, 140250311598144, 140250311598208, 94079215855408)
- dictionary: NULL
- children[0]:

In [4]: na.c_array_view(arr)
Out[4]: 
<nanoarrow.c_lib.CArrayView>
- storage_type: 'string_view'
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
  - <bool validity[1 b] 11010000>
  - <string_view data[0 b] b''>
- dictionary: NULL
- children[0]:
codecov-commenter commented 8 months ago

Codecov Report

Attention: 26 lines in your changes are missing coverage. Please review.

Comparison is base (b3c952a) 88.38% compared to head (b3a9894) 88.02%.

Files Patch % Lines
src/nanoarrow/schema.c 0.00% 19 Missing :warning:
src/nanoarrow/nanoarrow_types.h 0.00% 4 Missing :warning:
src/nanoarrow/array.c 25.00% 3 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #367 +/- ## ========================================== - Coverage 88.38% 88.02% -0.36% ========================================== Files 75 75 Lines 12677 12731 +54 ========================================== + Hits 11205 11207 +2 - Misses 1472 1524 +52 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jorisvandenbossche commented 8 months ago

@paleolimbot the reason that the Python binding for the ArrayView currently checks the buffer types dynamically to determine the number of buffers, is that so this works on an uninitialized ArrayView (just based on the layout, without an Array already being created)? See here:

https://github.com/apache/arrow-nanoarrow/blob/b3c952a3e21c2b47df85dbede3444f852614a3e2/python/src/nanoarrow/_lib.pyx#L685-L690

i.e. why not just return self._ptr.array.n_buffers here?

Because if we don't consider the variadic buffers as part of the ArrayView's buffers, then accessing like the above in Python doesn't work for those

paleolimbot commented 8 months ago

why not just return self._ptr.array.n_buffers here?

I've tried to remove all references to ArrayView.array from nanoarrow (C, Python, R, and otherwise), although I haven't removed the member from the ArrayView yet because I think some previous code relies on it. Basically, the ArrowArrayView doesn't care how the memory it's pointing to got there (i.e., there's no requirement that an ArrowArray ever existed). This is mostly helpful for reading IPC: the decoder first fills in an ArrowArrayView and only allocates the ArrowArray if requested.

Because if we don't consider the variadic buffers as part of the ArrayView's buffers, then accessing like the above in Python doesn't work for those

That's true. We could rewrite n_buffers() to += _ptr.n_variadic_buffers and rewrite buffer() to create a buffer view from the variaidic buffers if i is large enough (or error, since accessing variadic buffers directly isn't strictly needed to decode the content).

rouault commented 8 months ago

Sorry for hickjacking this thread but I've become aware of the introductions of those new arrow data types because GDAL compilation broke against arrow 15.0 (due to a switch() not handling the new cases). Will be addressed in https://github.com/OSGeo/gdal/pull/9116 in a minimalistic way by erroring out on those types. Do you know if those types will be actually found in serialized formats, namely Parquet and Feather ? And I hope that I won't have to add support for them in OGRLayer::WriteArrowBatch()... Their value proposition compared to regular string or binary is unclear to me. The only thing I see is that they might be a way to reduce memory usage by pointing to the same offset in case of duplicated strings? but wasn't that the purpose of dictionaries? (actually the same question would hold for the RUN_END_ENCODED stuff added in libarrow 12). As a relatively new comer to the Arrow ecosystem, I should point that the proliferation of basic data types is going to be a serious obstacle to adoption by new implementations. Perhaps there should be some "data type negociation" mechanism where a consumer could tell to the producer something like "I just understand Int32, Float64, String. Do your best to present me only those data types by possibly morphing content to that"

paleolimbot commented 8 months ago

Do you know if those types will be actually found in serialized formats, namely Parquet and Feather ?

I don't believe it's possible to get a stringview or binaryview from Parquet; however, in theory you could get one from Feather or Arrow IPC today with Arrow 15. I think that until pyarrow implements a basic level of support it's unlikely to actually end up being used. Some day the C++ Parquet scanner might be able to return those types but there are very few people working on that part of the code base and I think it's unlikely to be implemented any time soon.

Their value proposition compared to regular string or binary is unclear to me.

I think the gist of it is that regular string and binary are slow to sort, which is why Meta's Velox, DuckDB, and (very recently) Polars have adopted it as their primary representation. Arrow added it primarily for interchange with those systems (e.g., a user-defined function based on the C Data interface) although I agree that it came at the expense unnecessary complexity for 99.9% of Arrow users.

I should point that the proliferation of basic data types is going to be a serious obstacle to adoption by new implementations.

I agree. In theory this is what nanoarrow is designed to help with (when the types are supported...so not yet). The focus of the version about to be released is testing and stability...0.5.0 is more likely to include features that could be used in a fallback sort of way (i.e., maybe ArrowArrayViewGetLogicalXXX() that would handle the dictionary, run-end-encoded, view, or normal cases at the expense of an extra switch()).

Perhaps there should be some "data type negociation" mechanism

In Arrow C++, this is most likely to be supported via an option in the readers. For Parquet In Python, the __arrow_c_schema__() and __arrow_c_stream__(requested_schema=None) methods can in theory handle this (i.e., you query __arrow_c_schema__() and if it contains types you don't understand, you pass a different schema to __arrow_c_stream__()'s requested_schema. That said, pyarrow doesn't implement requested_schema yet, but it also doesn't really implement the view types either.

And I hope that I won't have to add support for them in OGRLayer::WriteArrowBatch()

For now I think you will be hard-pressed to find a producer that actually produces REE or View types (ListView is also on the horizon if it's not already implemented in Arrow C++). It's on nanoarrow's roadmap to support all of them (but Python bindings are on the roadmap first, which is no small feat!)...perhaps it can help!

jorisvandenbossche commented 8 months ago

Basically, the ArrowArrayView doesn't care how the memory it's pointing to got there (i.e., there's no requirement that an ArrowArray ever existed). This is mostly helpful for reading IPC: the decoder first fills in an ArrowArrayView and only allocates the ArrowArray if requested.

OK, understood! I see the C doc comment actually mentions this: "use it to represent a hypothetical ArrowArray that does not exist yet, or use it to validate the buffers of a future ArrowArray.".

eitsupi commented 8 months ago

FYI, note that Python Polars can create Arrow files with the Utf8View type written to them by using the experimental option;

>>> import polars as pl
>>> import pyarrow.feather as feather
>>> pl.DataFrame({"a": ["foo"]}).write_ipc("test.arrow", future=True)
>>> feather.read_table("test.arrow")
pyarrow.Table
a: string_view
----
a: [["foo"]]

Since Rust Polars started using the Utf8View type, downstream Rust projects will use it.

paleolimbot commented 8 months ago

Nice! We're currently focused on a few other short-term nanoarrow things (e.g., Python bindings) but Joris did all the hard work here and we should be able to get this merged in the next few months with support in R and Python.

eitsupi commented 5 months ago

Hey, R polars package's Series/DataFrame to/from nanoarrow_array_stream conversion have been rewritten to only using the C Stream interface (pola-rs/r-polars#1075, pola-rs/r-polars#1076, pola-rs/r-polars#1078). Even though nanoarrow 0.4.0 does not support StringView, It seems to work fine even when StringView is included.

polars::pl$Series(values = "foo") |>
  nanoarrow::as_nanoarrow_array_stream(future = TRUE) |>
  polars::as_polars_series()
#> polars Series: shape: (1,)
#> Series: '' [str]
#> [
#>  "foo"
#> ]

Created on 2024-05-06 with reprex v2.1.0

paleolimbot commented 4 months ago

@eitsupi Sorry I missed this comment!

It seems to work fine even when StringView is included.

nanoarrow (for R or Python) can definitely transport any type (even invalid ones!), even though conversion to/from R won't work. I think you might also run into trouble printing these (although I believe I have some tests for this...you'd probably be able to see the format and buffer addresses).

JayjeetAtGithub commented 1 month ago

Hi @jorisvandenbossche @paleolimbot We are trying to support interop between arrow::StringViewArray / arrow::BinaryViewArray and cudf::column in libcudf. But since, libcudf uses nanoarrow which does not have string view or binary view types yet, we are blocked. If nanoarrow could support these two types (along with the mappings into ArrowArray / ArrowDeviceArray / ArrowSchema) implementation for these types, we could implement the interop to cudf::column.

paleolimbot commented 1 month ago

I'll take a stab at this in the next few days to see what is required beyond this PR. You probably just need a "build by buffer" and "consume by buffer" level of support (as opposed to building by element or consuming by element, which is harder).

kylebarron commented 1 month ago

Their value proposition compared to regular string or binary is unclear to me

See also for more reading:

paleolimbot commented 1 week ago

Closing since this was implemented in #596 🚀