apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.43k stars 3.51k forks source link

[R] writing/reading a data.frame with column class 'list' changes column class #33784

Open OfekShilon opened 1 year ago

OfekShilon commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

(and in addition, adds a ptype attribute - as already detailed in #15248)

# One way to create a column with 'list' class:
library(tibble)
tb <- tibble(list_column = list(c(a = 1, b = 2)))
df <- as.data.frame(tb)
class(df$list_column)
# [1] "list"

# Write + read back:
tmpf <- tempfile()
arrow::write_feather(df, tmpf)
df2 <- arrow::read_feather(tmpf)
class(df2$list_column)
# [1] "arrow_list"    "vctrs_list_of" "vctrs_vctr"    "list"         

unlink(tmpf)

Component(s)

R

OfekShilon commented 1 year ago

A repro with no use of tibble:

df <- data.frame(a=I(list(list(1), list(8))))
class(df$a[[1]])
# [1] "list"

tmpf <- tempfile()
arrow::write_feather(df, tmpf)
df2 <- arrow::read_feather(tmpf)
class(df2$a[[1]])
#[1] "arrow_list"    "vctrs_list_of" "vctrs_vctr"    "list"     

unlink(tmpf)
OfekShilon commented 1 year ago

When the column class changes to c("arrow_list", "vctrs_list_of", "vctrs_vctr","list" ) rbind operations (and possibly others) break:

library(tibble)
tb <- tibble(list_column = list(c(a = 1, b = 2)))
df <- as.data.frame(tb)
class(df$list_column)   # [1] "list"

# Write + read back:
tmpf <- tempfile()
arrow::write_feather(df, tmpf)
df2 <- arrow::read_feather(tmpf)
class(df2$list_column)  # [1] "arrow_list"    "vctrs_list_of" "vctrs_vctr"    "list"    

rbind(df,df)    # works
rbind(df2,df2)
# Error in `stop_vctrs()`:
# ! `levels.arrow_list()` not supported.
OfekShilon commented 1 year ago

And during save of elements with class "list", list names are lost.

> df <- data.frame(col1=I(list(list(a=1), list(b=8))))
> df$col1[1]
[[1]]
[[1]]$a            <-----
[1] 1

> tmpf <- tempfile()
> arrow::write_feather(df, tmpf)
> df2 <- arrow::read_feather(tmpf)
> df2$col1[1]
[[1]]
<list<double>[1]>
[[1]]              <-----
[1] 1

> unlink(tmpf)
thisisnic commented 1 year ago

Thanks for reporting this @OfekShilon. In terms of the last comment above and reported behaviour therein, some of this can be seen in #15033 too.

paleolimbot commented 1 year ago

Echoing Nic's thanks for opening this...our support for list columns is far from perfect. In particular, we drop names without warning, which we should fix.

Supporting names internally in Arrow is hard because Arrow doesn't have an internal concept of named things so we would have to invent one. We will probably get there - probably via an extension type - but in the meantime you will have to do some conversion to/from arrow yourself as a workaround.

The two workarounds I can think of off the top of my head are (1) serialize list objects on the way in and unserialize them on the way out:

library(tibble)

tb <- tibble(list_column = list(c(a = 1, b = 2)))
str(tb$list_column)
#> List of 1
#>  $ : Named num [1:2] 1 2
#>   ..- attr(*, "names")= chr [1:2] "a" "b"

serialize_list_col_to_binary <- function(x) {
  lapply(x, serialize, NULL)
}

unserialize_list_col_from_binary <- function(x) {
  lapply(x, unserialize)
}

# Write + read back:
tmpf <- tempfile()
tb$list_column <- serialize_list_col_to_binary(tb$list_column)
arrow::write_feather(tb, tmpf)
df2 <- arrow::read_feather(tmpf)
df2$list_column <- unserialize_list_col_from_binary(df2$list_column)

str(df2$list_column)
#> List of 1
#>  $ : Named num [1:2] 1 2
#>   ..- attr(*, "names")= chr [1:2] "a" "b"

...or (2) do some of your own modifications to make the list element types fit better in Arrow. In your case, your list elements could be data.frames:

library(tibble)

tb <- tibble(list_column = list(c(a = 1, b = 2)))
str(tb$list_column)
#> List of 1
#>  $ : Named num [1:2] 1 2
#>   ..- attr(*, "names")= chr [1:2] "a" "b"

list_col_to_arrow_friendly <- function(x) {
  lapply(x, function(x) {
    if (is.null(x)) NULL else as.data.frame(as.list(x))
  })
}

tb$list_column <- list_col_to_arrow_friendly(tb$list_column)
str(tb$list_column)
#> List of 1
#>  $ :'data.frame':    1 obs. of  2 variables:
#>   ..$ a: num 1
#>   ..$ b: num 2

# Write + read back:
tmpf <- tempfile()
arrow::write_feather(tb, tmpf)
df2 <- arrow::read_feather(tmpf)
str(df2$list_column)
#> list<
#>   tbl_df<
#>     a: double
#>     b: double
#>   >
#> > [1:1] 
#> $ : tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
#>  ..$ a: num 1
#>  ..$ b: num 2
#> @ ptype: tibble [0 × 2] (S3: tbl_df/tbl/data.frame)
#>  ..$ a: num(0) 
#>  ..$ b: num(0)

Created on 2023-02-22 with reprex v2.0.2

OfekShilon commented 1 year ago

@paleolimbot Thanks for the suggestions. However:

  1. The unserialize code seems broken:

    ...
    df2 <- arrow::read_feather(tmpf)
    df2$list_column <- unserialize_list_col_from_binary(df2$list_column)
    # Error in FUN(X[[i]], ...) : 'connection' must be a connection
    unserialize(df2$list_column)
    # Error in unserialize(df2$list_column) : 'connection' must be a connection
  2. In the list_col_to_arrow_friendly approach, after a save/load roundtrip I'm getting an arrow_list of tibbles, which is rather far from the list of lists I set out with.

  3. In both approaches, as noted above - the class of columns returned by a save/load rountrip (c("arrow_list","vctrs_list_of","vctrs_vctr","list") causes rbind to break. This might be caused by the vctrs package - any insight into it?