apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.54k stars 3.54k forks source link

[R] Behaviour of R-specific key/value metadata in the query engine #32017

Open asfimport opened 2 years ago

asfimport commented 2 years ago

In ARROW-16607 there are some changes to metadata handling in the arrow_dplyr_query. With extension type support, more column types (like sf::sfc) can be supported, and with growing support for column types comes a greater chance that our current metadata restoration by default policy will cause difficult-to-work-around errors. The latest one I have run across is this one:


library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
# required for write_dataset(nc) to work
# remotes::install_github("paleolimbot/geoarrow")
library(geoarrow)
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE

nc <- read_sf(system.file("shape/nc.shp", package = "sf"))
tf <- tempfile()
write_dataset(nc, tf)

open_dataset(tf) %>% 
  select(NAME, FIPS) %>% 
  collect()
#> Error in st_geometry.sf(x): attr(obj, "sf_column") does not point to a geometry column.
#> Did you rename it, without setting st_geometry(obj) <- "newname"?

This causes an error because the restored class has assumptions about the contents of the data frame that we can't necessarily know about (or would have to hard code for every data frame subclass).

I can see why arrow::write_parquet() and arrow::read_parquet() (and feather, ipc_stream) might want to do this to faithfully roundtrip a data frame, and because the write/read roundtrip (usually) involves the same columns and the same rows, it's probably safe to restore metadata by default.

The query engine does a lot of transformations that can break assumptions like the one I've shown above (where sf expects a certain column to exist and errors otherwise in a way that the user can't work around). Rather than hard-code the assumptions of every data.frame and vector subclass, I wonder if ignoring the R metadata for query engine output would be a better strategy. If it's not the default, it would be nice to provide an escape hatch for users or developers that find themselves in this position with no workaround.

With the addition of the vctrs extension type, there is a route to preserve attributes through the query engine (although it's a bit verbose). We could make it easier to do (e.g., by interpreting I() or rlang::box() in some way).


library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df <- data.frame(int_col = 1:5)
attr(df$int_col, "some_attr") <- "some_value"

tf <- tempfile()

#  attributes dropped when column is renamed
write_dataset(df, tf)

open_dataset(tf) %>% 
  select(other_int_col = int_col) %>% 
  collect() %>% 
  pull()
#> [1] 1 2 3 4 5

# attributes preserved when column is renamed
table <- arrow_table(int_col = vctrs_extension_array(df$int_col))
write_dataset(table, tf)

open_dataset(tf) %>% 
  select(other_int_col = int_col) %>% 
  collect() %>% 
  pull()
#> [1] 1 2 3 4 5
#> attr(,"some_attr")
#> [1] "some_value"

Reporter: Dewey Dunnington / @paleolimbot

Note: This issue was originally created as ARROW-16670. Please see the migration documentation for further details.

asfimport commented 2 years ago

Weston Pace / @westonpace:

I wonder if ignoring the R metadata for query engine output would be a better strategy. If it's not the default, it would be nice to provide an escape hatch for users or developers that find themselves in this position with no workaround.

This would be my assumption. The query engine has no idea what metadata is. It does not really make any attempt to preserve it.

Sometimes users are doing something like rewriting a file with a different chunk size or repartitioning a dataset. In this case it can sometimes make sense to persist the origin metadata. However, I think the best solution for that is to reattach the metadata after it has gone through the query engine. The write/sink nodes should have options to attach custom metadata. We can expand on these as needed.