apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.23k stars 3.47k forks source link

[R] ExtensionType support in R #30947

Closed asfimport closed 2 years ago

asfimport commented 2 years ago

In Python there is support for extension types that consists of a registration step that defines functions to handle metadata serialization and deserialization. In R, any extension name or metadata at the top level is currently obliterated on import. To implement geometry reading and writing to Parquet, IPC, and/or Feather, we will need to at the very least have the extension name and metadata preserved (in R), and at best provide a registration step to customize the behaviour of the resulting Array/DataType.

Reprex for R:


# remotes::install_github("paleolimbot/narrow")
library(narrow)

carray <- as_narrow_array(1:5)

carray$schema$metadata[["ARROW:extension:name"]] <- "extension name!"
carray$schema$metadata[["ARROW:extension:metadata"]] <- "bananas"
carray$schema$metadata[["something else"]] <- "more bananas"

array <- from_narrow_array(carray, arrow::Array)
carray2 <- as_narrow_array(array)

carray2$schema$metadata[["ARROW:extension:name"]]
#> NULL
carray2$schema$metadata[["ARROW:extension:metadata"]]
#> NULL
carray2$schema$metadata[["something else"]]
#> NULL

There is some discussion of that as a solution to ARROW-14378, including an example of how pandas implements the 'interval' extension type (example contributed by @jorisvandenbossche).

For the Interval example, there are some different parts living in different places:

Reporter: Dewey Dunnington / @paleolimbot Assignee: Dewey Dunnington / @paleolimbot

PRs and other links:

Note: This issue was originally created as ARROW-15471. Please see the migration documentation for further details.

asfimport commented 2 years ago

Dewey Dunnington / @paleolimbot: A related issue that came up in geoarrow is that it isn't possible to restore field-level metadata (which we can use to make sure things like the coordinate reference system stay with the column when it goes through the compute engine). It looks like this is specifically ignored here:

https://github.com/apache/arrow/blob/master/r/R/field.R#L60

asfimport commented 2 years ago

Weston Pace / @westonpace: Is the field level metadata being lost at the moment? Extension type metadata should persist through the compute engine (some what tangential discussion here: https://issues.apache.org/jira/browse/ARROW-15297)

asfimport commented 2 years ago

Dewey Dunnington / @paleolimbot: The compute engine end is all good! I was trying to put field-level metadata through the C API where Arrow didn't expect it (apparently metadata only persists if you wrap an Array in in a RecordBatch). This might be different if trying to import an ExtenionArray (which we haven't implemented yet).

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: I don't know if it is exactly relevant here, but a few notes:

asfimport commented 2 years ago

Dewey Dunnington / @paleolimbot: Thanks...super relevant! I'm definitely not familiar with the details here.

In the process of playing with this, I found that R does preserve the field-level metadata when handled as part of a RecordBatch just as you noted. The comment I made above was a line where I noticed that a user can't specify metadata for a field (but importing via the C API works fine).

Implementing the ExtensionArray/ExtensionType and the registration mechanism is what I hope to do with this ticket!

asfimport commented 2 years ago

Jonathan Keane / @jonkeane: Issue resolved by pull request 12467 https://github.com/apache/arrow/pull/12467