apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.23k stars 3.47k forks source link

[R][Python] Extension types cannot be registered in both R and Python #20265

Open asfimport opened 2 years ago

asfimport commented 2 years ago

When registering extension types as is now possible in the R bindings, it looks as though we cannot register an extension type in R and Python at the same time:


# apache/arrow@master
library(arrow, warn.conflicts = FALSE)
library(reticulate)

# this is a virtualenv with pyarrow installed against the same commit
use_virtualenv(
  "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev",
  required = TRUE
)

pa <- import("pyarrow")
pa[["__version__"]]
#> [1] "9.0.0.dev131+g8a36f0f6c"

py_run_string("
import pyarrow as pa

class TestExtensionType(pa.ExtensionType):

    def __init__(self):
        super().__init__(pa.int32(), 'arrow.test_type')

    def __arrow_ext_serialize__(self):
        return b''

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        return cls()

pa.register_extension_type(TestExtensionType())
")

arrow::register_extension_type(
  arrow::new_extension_type(int32(), "arrow.test_type")
)
#> Error: Key error: A type extension with name arrow.test_type already defined

I also get a segfault if I try to surface a Python type into R (probably because the R bindings mistakenly assume that if type.id() == Type::EXTENSION then it is safe to cast to our own ExtensionType C++ subclass that implements R-specific things.

This came about because the 'geoarrow' Python and 'geoarrow' R packages both register a number of extension type definitions.

Reporter: Dewey Dunnington / @paleolimbot

Note: This issue was originally created as ARROW-16688. Please see the migration documentation for further details.

asfimport commented 2 years ago

Dewey Dunnington / @paleolimbot: cc @jorisvandenbossche

asfimport commented 2 years ago

Antoine Pitrou / @pitrou:

probably because the R bindings mistakenly assume that if type.id() == Type::EXTENSION then it is safe to cast to our own ExtensionType C++ subclass that implements R-specific things

That would be the first thing to fix IMHO.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Also it would be nice if you could test with the published R and Python binary packages instead (and/or nightly builds thereof).

asfimport commented 2 years ago

Dewey Dunnington / @paleolimbot: Agreed that the R bindings need to fix that! The approach I'm thinking of will create a second copy of the extension registry at the R level so that the Arrow -> R conversion defined by the extension type as registered in R will kick in no matter where the extension type was registered. The rules for type equality and printing should be the same in all places and I think that's reasonable to ask of extension type developers hoping to provide support in multiple languages.

Somewhere in Python must make that assumption as well since registering an extension type will break reading a file:


# apache/arrow@master
library(arrow, warn.conflicts = FALSE)
library(reticulate)

# this is a virtualenv with pyarrow installed against the same commit
use_virtualenv(
  "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev",
  required = TRUE
)

pa <- import("pyarrow")
feather <- import("pyarrow.feather")

test_type <- arrow::new_extension_type(int32(), "arrow.test_type")
arrow::register_extension_type(test_type)

tf <- tempfile()
write_feather(record_batch(x = new_extension_array(1:5, test_type)), tf)

read_feather(tf, as_data_frame = FALSE)
#> Table
#> 5 rows x 1 columns
#> $x <ExtensionType <>>

# fails if extension type is registered
feather$read_feather(tf)
#> Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 'x'
#> 
#> Detailed traceback:
#>   File "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev/lib/python3.9/site-packages/pyarrow/feather.py", line 230, in read_feather
#>     return (read_table(
#>   File "pyarrow/array.pxi", line 823, in pyarrow.lib._PandasConvertible.to_pandas
#>   File "pyarrow/table.pxi", line 3897, in pyarrow.lib.Table._to_pandas
#>   File "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 803, in table_to_blockmanager
#>     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
#>   File "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1155, in _table_to_blocks
#>     return [_reconstruct_block(item, columns, extension_columns)
#>   File "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1155, in <listcomp>
#>     return [_reconstruct_block(item, columns, extension_columns)
#>   File "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 759, in _reconstruct_block
#>     pandas_dtype = extension_columns[name]
arrow::unregister_extension_type("arrow.test_type")
feather$read_feather(tf)
#>   x
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5

Also it would be nice if you could test with the published R and Python binary packages instead (and/or nightly builds thereof).

I also tested this following the install directions for conda (https://arrow.apache.org/install/#c-and-python-conda-packages). It's not the most common way for native R users to interact with arrow, but it is probably the most common method by which GDAL with Arrow will be available. The only extra step I used was conda install -c conda-forge r-reticulate`` andreticulate::use_python("<the output of which python in the conda env>")`.

The interaction that breaks GDAL is that the type is stripped if the extension type is not registered (the most common case). We can PR to fix this in GDAL by checking the storage type if it is an extension array or by registering the extension type there. I could also stop writing files with extension type columns (although then we don't get query engine magic without a custom dataset class).

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: I don't think your development setup is related to this. We know that in C++ an extension type can only registered once (at the moment), so if we are using a shared C++ library for the R and Python package, and if those two bindings are used in the same process, then registering an extension type with the same name twice (from each binding) logically will result in that error.

So we will need to think about how to deal with the issue of an extension type getting registered from multiple languages, whether we want to allow this and how to do this (I think for the kernel function registry, there are currently also some discussions about having multiple registries, ARROW-16677 / https://github.com/apache/arrow/pull/13214)

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche:

The interaction that breaks GDAL is that the type is stripped if the extension type is not registered (the most common case). We can PR to fix this in GDAL by checking the storage type if it is an extension array or by registering the extension type there. I could also stop writing files with extension type columns (although then we don't get query engine magic without a custom dataset class).

For this specific aspect (which is not directly related to registering the same extension type from two bindings), I opened an issue upstream: https://github.com/OSGeo/gdal/issues/5834