Closed asfimport closed 2 years ago
Dewey Dunnington / @paleolimbot: A related issue that came up in geoarrow is that it isn't possible to restore field-level metadata (which we can use to make sure things like the coordinate reference system stay with the column when it goes through the compute engine). It looks like this is specifically ignored here:
Weston Pace / @westonpace: Is the field level metadata being lost at the moment? Extension type metadata should persist through the compute engine (some what tangential discussion here: https://issues.apache.org/jira/browse/ARROW-15297)
Dewey Dunnington / @paleolimbot: The compute engine end is all good! I was trying to put field-level metadata through the C API where Arrow didn't expect it (apparently metadata only persists if you wrap an Array in in a RecordBatch). This might be different if trying to import an ExtenionArray (which we haven't implemented yet).
Joris Van den Bossche / @jorisvandenbossche: I don't know if it is exactly relevant here, but a few notes:
Field
class. So an Array object itself cannot hold this metadata in C++. Thus, if you recreate an array from importing it from the C Data Interface, it is "expected" that the metadata is gone. And that's also the reason that if the array is part of a RecordBatch (which has a schema, with fields, potentially with metadata), that this metadata is preserved (in the schema of the RecordBatch)Dewey Dunnington / @paleolimbot: Thanks...super relevant! I'm definitely not familiar with the details here.
In the process of playing with this, I found that R does preserve the field-level metadata when handled as part of a RecordBatch just as you noted. The comment I made above was a line where I noticed that a user can't specify metadata for a field (but importing via the C API works fine).
Implementing the ExtensionArray/ExtensionType and the registration mechanism is what I hope to do with this ticket!
Jonathan Keane / @jonkeane: Issue resolved by pull request 12467 https://github.com/apache/arrow/pull/12467
In Python there is support for extension types that consists of a registration step that defines functions to handle metadata serialization and deserialization. In R, any extension name or metadata at the top level is currently obliterated on import. To implement geometry reading and writing to Parquet, IPC, and/or Feather, we will need to at the very least have the extension name and metadata preserved (in R), and at best provide a registration step to customize the behaviour of the resulting Array/DataType.
Reprex for R:
There is some discussion of that as a solution to ARROW-14378, including an example of how pandas implements the 'interval' extension type (example contributed by @jorisvandenbossche).
For the Interval example, there are some different parts living in different places:
Reporter: Dewey Dunnington / @paleolimbot Assignee: Dewey Dunnington / @paleolimbot
PRs and other links:
Note: This issue was originally created as ARROW-15471. Please see the migration documentation for further details.