Open OfekShilon opened 1 year ago
A repro with no use of tibble
:
df <- data.frame(a=I(list(list(1), list(8))))
class(df$a[[1]])
# [1] "list"
tmpf <- tempfile()
arrow::write_feather(df, tmpf)
df2 <- arrow::read_feather(tmpf)
class(df2$a[[1]])
#[1] "arrow_list" "vctrs_list_of" "vctrs_vctr" "list"
unlink(tmpf)
When the column class changes to c("arrow_list", "vctrs_list_of", "vctrs_vctr","list" )
rbind operations (and possibly others) break:
library(tibble)
tb <- tibble(list_column = list(c(a = 1, b = 2)))
df <- as.data.frame(tb)
class(df$list_column) # [1] "list"
# Write + read back:
tmpf <- tempfile()
arrow::write_feather(df, tmpf)
df2 <- arrow::read_feather(tmpf)
class(df2$list_column) # [1] "arrow_list" "vctrs_list_of" "vctrs_vctr" "list"
rbind(df,df) # works
rbind(df2,df2)
# Error in `stop_vctrs()`:
# ! `levels.arrow_list()` not supported.
And during save of elements with class "list", list names are lost.
> df <- data.frame(col1=I(list(list(a=1), list(b=8))))
> df$col1[1]
[[1]]
[[1]]$a <-----
[1] 1
> tmpf <- tempfile()
> arrow::write_feather(df, tmpf)
> df2 <- arrow::read_feather(tmpf)
> df2$col1[1]
[[1]]
<list<double>[1]>
[[1]] <-----
[1] 1
> unlink(tmpf)
Thanks for reporting this @OfekShilon. In terms of the last comment above and reported behaviour therein, some of this can be seen in #15033 too.
Echoing Nic's thanks for opening this...our support for list columns is far from perfect. In particular, we drop names without warning, which we should fix.
Supporting names internally in Arrow is hard because Arrow doesn't have an internal concept of named things so we would have to invent one. We will probably get there - probably via an extension type - but in the meantime you will have to do some conversion to/from arrow yourself as a workaround.
The two workarounds I can think of off the top of my head are (1) serialize list objects on the way in and unserialize them on the way out:
library(tibble)
tb <- tibble(list_column = list(c(a = 1, b = 2)))
str(tb$list_column)
#> List of 1
#> $ : Named num [1:2] 1 2
#> ..- attr(*, "names")= chr [1:2] "a" "b"
serialize_list_col_to_binary <- function(x) {
lapply(x, serialize, NULL)
}
unserialize_list_col_from_binary <- function(x) {
lapply(x, unserialize)
}
# Write + read back:
tmpf <- tempfile()
tb$list_column <- serialize_list_col_to_binary(tb$list_column)
arrow::write_feather(tb, tmpf)
df2 <- arrow::read_feather(tmpf)
df2$list_column <- unserialize_list_col_from_binary(df2$list_column)
str(df2$list_column)
#> List of 1
#> $ : Named num [1:2] 1 2
#> ..- attr(*, "names")= chr [1:2] "a" "b"
...or (2) do some of your own modifications to make the list element types fit better in Arrow. In your case, your list elements could be data.frames:
library(tibble)
tb <- tibble(list_column = list(c(a = 1, b = 2)))
str(tb$list_column)
#> List of 1
#> $ : Named num [1:2] 1 2
#> ..- attr(*, "names")= chr [1:2] "a" "b"
list_col_to_arrow_friendly <- function(x) {
lapply(x, function(x) {
if (is.null(x)) NULL else as.data.frame(as.list(x))
})
}
tb$list_column <- list_col_to_arrow_friendly(tb$list_column)
str(tb$list_column)
#> List of 1
#> $ :'data.frame': 1 obs. of 2 variables:
#> ..$ a: num 1
#> ..$ b: num 2
# Write + read back:
tmpf <- tempfile()
arrow::write_feather(tb, tmpf)
df2 <- arrow::read_feather(tmpf)
str(df2$list_column)
#> list<
#> tbl_df<
#> a: double
#> b: double
#> >
#> > [1:1]
#> $ : tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
#> ..$ a: num 1
#> ..$ b: num 2
#> @ ptype: tibble [0 × 2] (S3: tbl_df/tbl/data.frame)
#> ..$ a: num(0)
#> ..$ b: num(0)
Created on 2023-02-22 with reprex v2.0.2
@paleolimbot Thanks for the suggestions. However:
The unserialize
code seems broken:
...
df2 <- arrow::read_feather(tmpf)
df2$list_column <- unserialize_list_col_from_binary(df2$list_column)
# Error in FUN(X[[i]], ...) : 'connection' must be a connection
unserialize(df2$list_column)
# Error in unserialize(df2$list_column) : 'connection' must be a connection
In the list_col_to_arrow_friendly
approach, after a save/load roundtrip I'm getting an arrow_list
of tibble
s, which is rather far from the list of lists I set out with.
In both approaches, as noted above - the class of columns returned by a save/load rountrip (c("arrow_list","vctrs_list_of","vctrs_vctr","list")
causes rbind
to break. This might be caused by the vctrs
package - any insight into it?
Describe the bug, including details regarding any error messages, version, and platform.
(and in addition, adds a
ptype
attribute - as already detailed in #15248)Component(s)
R