GerkeLab / gerkelab-com

Website source for gerkelab.com
http://www.gerkelab.com
0 stars 3 forks source link

Post: Missing elements in list columns #31

Open gadenbuie opened 5 years ago

gadenbuie commented 5 years ago

I was going to post this on community.rstudio.com but it might be better as a blog post -- about half way through I worked out a good answer.

When working with list columns, it can be useful to mark entire elements as missing, but I’m struggling to find a consistent and easy-to-use data structure that works well with unnest().

Here’s a small example with a list column of tibbles, where, ideally, the 2nd element is “missing”. I’d like to unnest() column y but keep all of the rows in the original data frame. In real life, the tibbles in y are more complicated, but when present they all have the same number and type of columns.

The first idea I tried was to store missingness in the list column as NULL, but unnest() throws an error in this case.

library(tidyverse)
(data_null <- tibble(x = 1:2, y = list(tibble(z = 1L), NULL)))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <tibble [1 × 1]>
#> 2     2 <NULL>
data_null %>% unnest()
#> Each column must either be a list of vectors or a list of data frames [y]

The second idea was to use a zero-row data frame. I was hopeful this would work because it’s easy to grab a valid example and use the valid_ex[0, ] trick to create the zero-row data frame with the correct number and type of columns. This now works, but we lose the row with the zero-length data frame.

(data_zero_tibble <- tibble(x = 1:2, y = list(tibble(z = 1L), tibble())))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <tibble [1 × 1]>
#> 2     2 <tibble [0 × 0]>
data_zero_tibble %>% unnest()
#> # A tibble: 1 x 2
#>       x     z
#>   <int> <int>
#> 1     1     1

Even trying to .preserve column y in the unest() drops the zero-length row.

data_zero_tibble %>% unnest(y, .preserve = "y")
#> # A tibble: 1 x 3
#>       x y                    z
#>   <int> <list>           <int>
#> 1     1 <tibble [1 × 1]>     1

What does work is to explicitly use NA across rows with missing values.

(data_na_int <- tibble(x = 1:2, y = list(tibble(z = 1L), tibble(z = NA_integer_))))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <tibble [1 × 1]>
#> 2     2 <tibble [1 × 1]>
data_na_int %>% unnest()
#> # A tibble: 2 x 2
#>       x     z
#>   <int> <int>
#> 1     1     1
#> 2     2    NA

And the type of missing value doesn’t seem to matter.

(data_na_chr <- tibble(x = 1:2, y = list(tibble(z = 1L), tibble(.drop = NA_character_))))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <tibble [1 × 1]>
#> 2     2 <tibble [1 × 1]>
data_na_chr %>% unnest()
#> # A tibble: 2 x 3
#>       x     z .drop
#>   <int> <int> <chr>
#> 1     1     1 <NA> 
#> 2     2    NA <NA>

This might be the best solution, because it's not necessary to know anything about the other list elements in advance. All that is needed is an NA value in the same data shape as the other list elements.

(data_iris_zero <- tibble(x = 1:2, y = list(iris[1:2, ], iris[0,])))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <df[,5] [2 × 5]>
#> 2     2 <df[,5] [0 × 5]>
data_iris_zero %>% unnest()
#> # A tibble: 2 x 6
#>       x Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1     1          5.1         3.5          1.4         0.2 setosa 
#> 2     1          4.9         3            1.4         0.2 setosa

(data_iris_na <- tibble(x = 1:2, y = list(iris[1:2, ], data.frame(Sepal.Length = NA))))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <df[,5] [2 × 5]>
#> 2     2 <df[,1] [1 × 1]>
data_iris_na %>% unnest()
#> # A tibble: 3 x 6
#>       x Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1     1          5.1         3.5          1.4         0.2 setosa 
#> 2     1          4.9         3            1.4         0.2 setosa 
#> 3     2         NA          NA           NA          NA   <NA>

Finally, another solution is to use the zero-length data frame element and full_join() the unnest()ed data with the original data, minus the list column.

full_join(
  data_iris_zero %>% unnest(),
  data_iris_zero %>% select(-y)
)
#> Joining, by = "x"
#> # A tibble: 3 x 6
#>       x Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1     1          5.1         3.5          1.4         0.2 setosa 
#> 2     1          4.9         3            1.4         0.2 setosa 
#> 3     2         NA          NA           NA          NA   <NA>

Created on 2019-06-04 by the reprex package (v0.2.1)