apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.17k stars 3.46k forks source link

[C++] Add ExtensionType implementation for 8-bit boolean values #17682

Closed asfimport closed 4 weeks ago

asfimport commented 6 years ago

Some libraries (e.g. NumPy) represent boolean values using an array of int8 or uint8 values of 1's and 0's. This can present a challenge at times to receive such memory without copying.

Now that we have ExtensionType capabilities, we could define an extension type distinguish UInt8/Int8-annotated-as-boolean to be able to flow through such data in applications.

A discussion about introducing a new logical type didn't go anywhere, so having a custom container that can be used for these specialized applications is one way to unblock the use case. If we develop some endogenous use of such data in C++, we would need to be mindful to sanitize it to bitpacked boolean before sending to another Arrow application

Reporter: Wes McKinney / @wesm

PRs and other links:

Note: This issue was originally created as ARROW-1674. Please see the migration documentation for further details.

asfimport commented 6 years ago

Uwe Korn / @xhochy: This is only a hint that the data was initially 8bit but we won't support 8bit booleans? (My preferred answer would be "yes" here to keep the implementation of the Arrow spec as simple as possible)

asfimport commented 6 years ago

Wes McKinney / @wesm: The goal is to have enough metadata to support zero copy transport of memory to or from other runtimes. As a primary representation for computation, we would use the 1-bit variety. Right now there is no way to describe an 8-bit boolean in the metadata, and some applications that are only transporting memory (e.g. to/from Plasma) will not want to convert to bit-packed form

asfimport commented 6 years ago

Philipp Moritz / @pcmoritz: I'm giving this a shot now; one question here is if we want a separate type on the C++ side or one type with a boolean flag. I'm leaning towards a separate type BOOL8 right now.

asfimport commented 5 years ago

Wes McKinney / @wesm: I think we should probably define a metadata annotation for uint8/int8 to indicate that the data is semantically boolean. This will enable numpy.bool_ to be roundtrippped more gracefully. Doesn't necessarily need to be a formal part of the Arrow format

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Is there still an actual need for this?

asfimport commented 3 years ago

Weston Pace / @westonpace: Yes, it is still needed for zero-copy compatibility with numpy which can be useful in a few situations.