apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.42k stars 3.51k forks source link

[R] List column containing data frames with varying numbers of columns #30434

Open asfimport opened 2 years ago

asfimport commented 2 years ago

I'm brand new to arrow, but didn't seem to find anything like this issue in this bug tracker; apologies if this is a known issue. 

Arrow is giving me an error when I try to write Parquet or Feather files for a dataframe that contains a list column ({}df{} in the MWE) that contains dataframes that have varying numbers of columns:


library(tibble)
library(arrow)

df1 = data.frame(x = c(1, 2, 3), 
                 y = c('a', 'b', 'c'))

df2 = data.frame(x = c(4), 
                 y = c('d'), 
                 z = c('foo'))

comb_df = tibble(id = c(1, 2), 
                 df = c(list(df1), list(df2)))

write_dataset(comb_df, 'mwe', format = 'feather')

This gives me


Error: Unknown: Number of fields in struct (2) incompatible with number of columns in the data frame (3)

Session info:


─ Session info ────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.1.0 (2021-05-18)
 os       macOS Big Sur 11.6          
 system   x86_64, darwin17.0          
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2021-11-29                  

─ Packages ────────────────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 arrow       * 6.0.1   2021-11-20 [1] CRAN (R 4.1.0)
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
 bit           4.0.4   2020-08-04 [1] CRAN (R 4.1.0)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.1.0)
 cli           3.0.1   2021-07-17 [1] CRAN (R 4.1.0)
 crayon        1.4.1   2021-02-08 [1] CRAN (R 4.1.0)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
 fansi         0.5.0   2021-05-25 [1] CRAN (R 4.1.0)
 glue          1.4.2   2020-08-27 [1] CRAN (R 4.1.0)
 lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.0)
 magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
 pillar        1.6.3   2021-09-26 [1] CRAN (R 4.1.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.0)
 rlang         0.4.11  2021-04-30 [1] CRAN (R 4.1.0)
 rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.1.0)
 tibble      * 3.1.5   2021-09-30 [1] CRAN (R 4.1.0)
 tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
 utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
 vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
 withr         2.4.2   2021-04-18 [1] CRAN (R 4.1.0)

[1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

Environment: R 4.1.0, arrow 6.0.1, macOS Big Sur 11.6 Reporter: Dan Hicks

Note: This issue was originally created as ARROW-14909. Please see the migration documentation for further details.

asfimport commented 2 years ago

Dragoș Moldovan-Grünfeld / @dragosmg: Hi [~hicks.daniel.j@gmail.com],

Thanks for submitting this issue. You are correct, list columns of varying lengths are not yet supported in arrow. For the time being there are a couple of possible workarounds.


library(tibble)
library(arrow, warn.conflicts = FALSE)

df1 = data.frame(x = c(1, 2, 3), 
                 y = c('a', 'b', 'c'))

df2 = data.frame(x = c(4), 
                 y = c('d'), 
                 z = c('foo'))

comb_df = tibble(id = c(1, 2), 
                 df = c(list(df1), list(df2)))

# make them all have the same column names
all_ptypes <- lapply(comb_df$df, vctrs::vec_ptype)
common_ptype <- vctrs::vec_ptype_common(!!! all_ptypes)
comb_df$df <- lapply(comb_df$df, vctrs::vec_cast, common_ptype)
Table$create(comb_df)
#> Table
#> 2 rows x 2 columns
#> $id <double>
#> $df: list<item: struct<x: double, y: string, z <string>>>

# serialize them to JSON
comb_df = tibble(id = c(1, 2), 
                 df = c(list(df1), list(df2)))

comb_df$df <- vapply(comb_df$df, jsonlite::toJSON, character(1))
Table$create(comb_df)
#> Table
#> 2 rows x 2 columns
#> $id <double>
#> $df <string>