apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.46k stars 3.52k forks source link

[Python] Cannot create RecordBatch with nested struct containing extension type #33059

Open asfimport opened 2 years ago

asfimport commented 2 years ago

I'm running into the following issue:


pyarrow.lib.ArrowNotImplementedError: Unsupported cast to extension<vast.address<AddressType>> from fixed_size_binary[16]

Use case: I want to create a record batch that contains this type:


pa.struct([("address", AddressType()), ("length", pa.uint8())])

Here, AddressType is an extension type that models an IP address ({}pa.binary(16){}).

Please find attached a self-contained example that illustrates the issue.

 

Environment: macOS 12.5.1 on an Apple M1 Ultra. Reporter: Matthias Vallentin / @mavam

Original Issue Attachments:

Note: This issue was originally created as ARROW-17839. Please see the migration documentation for further details.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: @mavam Thanks for the report and the nice example!

So what is missing here is the automatic cast from the storage type to the extension type. This is currently being tackled in https://github.com/apache/arrow/pull/14106 (ARROW-14500).

Testing your example with the branch of https://github.com/apache/arrow/pull/14106, it works for me (with one caveat that you need to register the extension types (eg pa.register_extension_type(subnet_type)) to preserve them in an IPC roundtrip.

asfimport commented 2 years ago

Matthias Vallentin / @mavam: Thanks for the point, @jorisvandenbossche. Glad to see that a fix is underway.

Would you mind pointing me to instructions on how to do the test that you performed? I am using Poetry and couldn't get the branch to compile. In theory, I thought this should do the trick:


[tool.poetry.dependencies]
#pyarrow = "^9.0"
pyarrow = { git = "https://github.com/milesgranger/arrow.git", branch = "ARROW-15545_cast-of-extension-types", subdirectory = "python" }

But this fails to compile due to missing dependencies. (I managed to workaround OpenSSL by providing the right env var, but now I'm stuck with Flight not being found.) I was hoping that there is some sort of dev guide that shows how to get going.

asfimport commented 2 years ago

Matthias Vallentin / @mavam: Since I didn't manage (yet) to try out the branch under development, I have one other issue, per the attached example. In this case I'm getting:


TypeError: Incompatible storage type dictionary<values=string, indices=int8, ordered=0> for extension type extension<vast.enumeration<EnumType>>

I'm not sure whether that's considered a "cast" internally or whether I'm simply not creating an ExtensionArray properly from a dictionary. Any guidance would be much appreciated.

enum.py

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche:

Would you mind pointing me to instructions on how to do the test that you performed? I am using Poetry and couldn't get the branch to compile. In theory, I thought this should do the trick:

Since it is not yet merged, you would have to install pyarrow from source, and thus also build Arrow C++ from source, which won't be possible with just poetry (see https://arrow.apache.org/docs/dev/developers/guide/step_by_step/building.html and https://arrow.apache.org/docs/dev/developers/python.html). We do have nightly packages, but so for that you will have to wait until it is merged.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche:

Since I didn't manage (yet) to try out the branch under development, I have one other issue, per the attached example. In this case I'm getting:

There is a small difference between the types. Your extension type is defined with a dictionary of uint8 indices, while the dictionary array you have as int8 indices (uint8 vs int8). At the moment this needs to be exactly the same, but so the referenced PR should also allow casting from dictionary to extension<dictionary>.

asfimport commented 2 years ago

Matthias Vallentin / @mavam: Thanks for the guidance, Joris!

Regarding the dictionary, nowhere in the code I use {}int8{}. Where do I implicitly "commit" to int8 without knowing it?

EDIT: I think I found the issue. Going through pa.array(..., type=dictionary_type) is not creating indices of the type as given by {}dictionary_type{}. I had to go through pa.DictionaryArray.from_arrays with explicitly typed arrays. (The detail fix is here: https://github.com/tenzir/vast/pull/2606/files

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche:

Going through pa.array(..., type=dictionary_type) is not creating indices of the type as given by dictionary_type

That would be a bug IMO, but I can't directly reproduce this. In the below example, I specify int8, and it does return a dictionary type using int8:


>>> pa.array(['a', 'b', 'a'], pa.dictionary(pa.int8(), pa.string())).type
DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
asfimport commented 2 years ago

Matthias Vallentin / @mavam: Can you try any other type than {}int8{}? For example, uint8 fails for me:


>>> pa.array(['a', 'b', 'a'], pa.dictionary(pa.uint8(), pa.string())).type
DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
asfimport commented 1 year ago

Matthias Vallentin / @mavam: @jorisvandenbossche mind taking a look at this? This still fails for current pyarrow.

assignUser commented 1 year ago

Not a blocker, please re-add if you disagree @jorisvandenbossche