Open asfimport opened 2 years ago
Dewey Dunnington / @paleolimbot: cc @jorisvandenbossche
Antoine Pitrou / @pitrou:
probably because the R bindings mistakenly assume that if type.id() == Type::EXTENSION then it is safe to cast to our own ExtensionType C++ subclass that implements R-specific things
That would be the first thing to fix IMHO.
Antoine Pitrou / @pitrou: Also it would be nice if you could test with the published R and Python binary packages instead (and/or nightly builds thereof).
Dewey Dunnington / @paleolimbot: Agreed that the R bindings need to fix that! The approach I'm thinking of will create a second copy of the extension registry at the R level so that the Arrow -> R conversion defined by the extension type as registered in R will kick in no matter where the extension type was registered. The rules for type equality and printing should be the same in all places and I think that's reasonable to ask of extension type developers hoping to provide support in multiple languages.
Somewhere in Python must make that assumption as well since registering an extension type will break reading a file:
# apache/arrow@master
library(arrow, warn.conflicts = FALSE)
library(reticulate)
# this is a virtualenv with pyarrow installed against the same commit
use_virtualenv(
"/Users/deweydunnington/Desktop/rscratch/pyarrow-dev",
required = TRUE
)
pa <- import("pyarrow")
feather <- import("pyarrow.feather")
test_type <- arrow::new_extension_type(int32(), "arrow.test_type")
arrow::register_extension_type(test_type)
tf <- tempfile()
write_feather(record_batch(x = new_extension_array(1:5, test_type)), tf)
read_feather(tf, as_data_frame = FALSE)
#> Table
#> 5 rows x 1 columns
#> $x <ExtensionType <>>
# fails if extension type is registered
feather$read_feather(tf)
#> Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 'x'
#>
#> Detailed traceback:
#> File "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev/lib/python3.9/site-packages/pyarrow/feather.py", line 230, in read_feather
#> return (read_table(
#> File "pyarrow/array.pxi", line 823, in pyarrow.lib._PandasConvertible.to_pandas
#> File "pyarrow/table.pxi", line 3897, in pyarrow.lib.Table._to_pandas
#> File "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 803, in table_to_blockmanager
#> blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
#> File "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1155, in _table_to_blocks
#> return [_reconstruct_block(item, columns, extension_columns)
#> File "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1155, in <listcomp>
#> return [_reconstruct_block(item, columns, extension_columns)
#> File "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 759, in _reconstruct_block
#> pandas_dtype = extension_columns[name]
arrow::unregister_extension_type("arrow.test_type")
feather$read_feather(tf)
#> x
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
Also it would be nice if you could test with the published R and Python binary packages instead (and/or nightly builds thereof).
I also tested this following the install directions for conda (https://arrow.apache.org/install/#c-and-python-conda-packages). It's not the most common way for native R users to interact with arrow, but it is probably the most common method by which GDAL with Arrow will be available. The only extra step I used was conda install -c conda-forge r-reticulate`` and
reticulate::use_python("<the output of which python
in the conda env>")`.
The interaction that breaks GDAL is that the type is stripped if the extension type is not registered (the most common case). We can PR to fix this in GDAL by checking the storage type if it is an extension array or by registering the extension type there. I could also stop writing files with extension type columns (although then we don't get query engine magic without a custom dataset class).
Joris Van den Bossche / @jorisvandenbossche: I don't think your development setup is related to this. We know that in C++ an extension type can only registered once (at the moment), so if we are using a shared C++ library for the R and Python package, and if those two bindings are used in the same process, then registering an extension type with the same name twice (from each binding) logically will result in that error.
So we will need to think about how to deal with the issue of an extension type getting registered from multiple languages, whether we want to allow this and how to do this (I think for the kernel function registry, there are currently also some discussions about having multiple registries, ARROW-16677 / https://github.com/apache/arrow/pull/13214)
Joris Van den Bossche / @jorisvandenbossche:
The interaction that breaks GDAL is that the type is stripped if the extension type is not registered (the most common case). We can PR to fix this in GDAL by checking the storage type if it is an extension array or by registering the extension type there. I could also stop writing files with extension type columns (although then we don't get query engine magic without a custom dataset class).
For this specific aspect (which is not directly related to registering the same extension type from two bindings), I opened an issue upstream: https://github.com/OSGeo/gdal/issues/5834
When registering extension types as is now possible in the R bindings, it looks as though we cannot register an extension type in R and Python at the same time:
I also get a segfault if I try to surface a Python type into R (probably because the R bindings mistakenly assume that if
type.id() == Type::EXTENSION
then it is safe to cast to our ownExtensionType
C++ subclass that implements R-specific things.This came about because the 'geoarrow' Python and 'geoarrow' R packages both register a number of extension type definitions.
geoarrow's R registration: https://github.com/paleolimbot/geoarrow/blob/master/R/pkg-arrow.R#L208-L223
I can also force an interaction if I build GDAL against the same Arrow that the arrow R package is linked against and attempt to load a Feather file saved with an extension type using the sf package. I will attempt to recreate that interaction as well in both R and Python.
I don't know enough about linking to know to what extent this is linked to my own development setup/build of the R package, although I think there are at least some environments where a shared library is picked up first by the R config script (fedora36, for example). It does look like my own R package build is dynamically linking to libarrow.dylib.
Reporter: Dewey Dunnington / @paleolimbot
Note: This issue was originally created as ARROW-16688. Please see the migration documentation for further details.