I was going to post this on community.rstudio.com but it might be better as a blog post -- about half way through I worked out a good answer.
When working with list columns, it can be useful to mark entire elements as missing, but I’m struggling to find a consistent and easy-to-use data structure that works well with unnest().
Here’s a small example with a list column of tibbles, where, ideally, the 2nd element is “missing”. I’d like to unnest() column y but keep all of the rows in the original data frame. In real life, the tibbles in y are more complicated, but when present they all have the same number and type of columns.
The first idea I tried was to store missingness in the list column as NULL, but unnest() throws an error in this case.
library(tidyverse)
(data_null <- tibble(x = 1:2, y = list(tibble(z = 1L), NULL)))
#> # A tibble: 2 x 2
#> x y
#> <int> <list>
#> 1 1 <tibble [1 × 1]>
#> 2 2 <NULL>
data_null %>% unnest()
#> Each column must either be a list of vectors or a list of data frames [y]
The second idea was to use a zero-row data frame. I was hopeful this would work because it’s easy to grab a valid example and use the valid_ex[0, ] trick to create the zero-row data frame with the correct number and type of columns. This now works, but we lose the row with the zero-length data frame.
(data_zero_tibble <- tibble(x = 1:2, y = list(tibble(z = 1L), tibble())))
#> # A tibble: 2 x 2
#> x y
#> <int> <list>
#> 1 1 <tibble [1 × 1]>
#> 2 2 <tibble [0 × 0]>
data_zero_tibble %>% unnest()
#> # A tibble: 1 x 2
#> x z
#> <int> <int>
#> 1 1 1
Even trying to .preserve column y in the unest() drops the zero-length row.
data_zero_tibble %>% unnest(y, .preserve = "y")
#> # A tibble: 1 x 3
#> x y z
#> <int> <list> <int>
#> 1 1 <tibble [1 × 1]> 1
What does work is to explicitly use NA across rows with missing values.
(data_na_int <- tibble(x = 1:2, y = list(tibble(z = 1L), tibble(z = NA_integer_))))
#> # A tibble: 2 x 2
#> x y
#> <int> <list>
#> 1 1 <tibble [1 × 1]>
#> 2 2 <tibble [1 × 1]>
data_na_int %>% unnest()
#> # A tibble: 2 x 2
#> x z
#> <int> <int>
#> 1 1 1
#> 2 2 NA
And the type of missing value doesn’t seem to matter.
(data_na_chr <- tibble(x = 1:2, y = list(tibble(z = 1L), tibble(.drop = NA_character_))))
#> # A tibble: 2 x 2
#> x y
#> <int> <list>
#> 1 1 <tibble [1 × 1]>
#> 2 2 <tibble [1 × 1]>
data_na_chr %>% unnest()
#> # A tibble: 2 x 3
#> x z .drop
#> <int> <int> <chr>
#> 1 1 1 <NA>
#> 2 2 NA <NA>
This might be the best solution, because it's not necessary to know anything about the other list elements in advance. All that is needed is an NA value in the same data shape as the other list elements.
(data_iris_zero <- tibble(x = 1:2, y = list(iris[1:2, ], iris[0,])))
#> # A tibble: 2 x 2
#> x y
#> <int> <list>
#> 1 1 <df[,5] [2 × 5]>
#> 2 2 <df[,5] [0 × 5]>
data_iris_zero %>% unnest()
#> # A tibble: 2 x 6
#> x Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <int> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 5.1 3.5 1.4 0.2 setosa
#> 2 1 4.9 3 1.4 0.2 setosa
(data_iris_na <- tibble(x = 1:2, y = list(iris[1:2, ], data.frame(Sepal.Length = NA))))
#> # A tibble: 2 x 2
#> x y
#> <int> <list>
#> 1 1 <df[,5] [2 × 5]>
#> 2 2 <df[,1] [1 × 1]>
data_iris_na %>% unnest()
#> # A tibble: 3 x 6
#> x Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <int> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 5.1 3.5 1.4 0.2 setosa
#> 2 1 4.9 3 1.4 0.2 setosa
#> 3 2 NA NA NA NA <NA>
Finally, another solution is to use the zero-length data frame element and
full_join() the unnest()ed data with the original data, minus the list column.
full_join(
data_iris_zero %>% unnest(),
data_iris_zero %>% select(-y)
)
#> Joining, by = "x"
#> # A tibble: 3 x 6
#> x Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <int> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 5.1 3.5 1.4 0.2 setosa
#> 2 1 4.9 3 1.4 0.2 setosa
#> 3 2 NA NA NA NA <NA>
I was going to post this on community.rstudio.com but it might be better as a blog post -- about half way through I worked out a good answer.
When working with list columns, it can be useful to mark entire elements as missing, but I’m struggling to find a consistent and easy-to-use data structure that works well with
unnest()
.Here’s a small example with a list column of tibbles, where, ideally, the 2nd element is “missing”. I’d like to
unnest()
columny
but keep all of the rows in the original data frame. In real life, the tibbles iny
are more complicated, but when present they all have the same number and type of columns.The first idea I tried was to store missingness in the list column as
NULL
, butunnest()
throws an error in this case.The second idea was to use a zero-row data frame. I was hopeful this would work because it’s easy to grab a valid example and use the
valid_ex[0, ]
trick to create the zero-row data frame with the correct number and type of columns. This now works, but we lose the row with the zero-length data frame.Even trying to
.preserve
column y in theunest()
drops the zero-length row.What does work is to explicitly use
NA
across rows with missing values.And the type of missing value doesn’t seem to matter.
This might be the best solution, because it's not necessary to know anything about the other list elements in advance. All that is needed is an
NA
value in the same data shape as the other list elements.Finally, another solution is to use the zero-length data frame element and
full_join()
theunnest()
ed data with the original data, minus the list column.Created on 2019-06-04 by the reprex package (v0.2.1)