apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.91k stars 3.39k forks source link

[C++][Python] Metadata from C data interface is not valid utf8 #20107

Open asfimport opened 2 years ago

asfimport commented 2 years ago

While trying to roundtrip an extension from schema.metadata (see ARROW-13855 for details), I got invalid utf8, which imo goes against

A binary string describing the type’s metadata [1]

Specifically, a field

field = pyarrow.field("aa", UuidType())

contains the following:

key len: 20
key: "ARROW:extension:name"
value len: 23
value: "arrow.py_extension_type"
key len: 24
key: "ARROW:extension:metadata"
value len: 28

with the value's data for this key being:

[128, 3, 99, 116, 101, 115, 116, 95, 115, 113, 108, 10, 85, 117, 105, 100, 84, 121, 112, 101, 10, 113, 0, 41, 82, 113, 1, 46]

This is not a valid utf8 (see e.g. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=02b67658b3cddf8dc095bc9750fa7032).

Maybe I am reading the values incorrectly? (null point?)

[1] https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata

Reporter: Jorge Leitão / @jorgecarleitao

Note: This issue was originally created as ARROW-15613. Please see the migration documentation for further details.

asfimport commented 2 years ago

Jorge Leitão / @jorgecarleitao: cc @pitrou

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: There is actually a discussion to relax the utf8 requirement in IPC metadata values (see the message recently posted by @jorisvandenbossche  "Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams").

In short: yes, Arrow C++ and PyArrow can put arbitrary binary data in metadata values.

Also cc @lidavidm   @emkornfield  

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: (Side note: this might be just for quick testing, but if you actually want to use the extension type on the rust side as well, you should probably define the extension type in Python as a subclass of pyarrow.ExtensionType, and not pyarrow.PyExtensionType, since the latter uses a pickle dump of the class as the serialized metadata, which you won't be able to use in Rust, I suppose)