colearendt / tidyjson

Tidy your JSON data in R with tidyjson
Other
182 stars 14 forks source link

Parent array dropped when child array is empty #117

Open msinjin opened 4 years ago

msinjin commented 4 years ago

[cross posted on stack overflow, getting no love]

It seems as though tidyjson uses an inner-join-like behaviour on nested arrays, thus dropping records with empty child arrays. Is there a way to get left-join-like behaviour instead, filling with NAs?

For example, these fake data have one record with a populated nested array (middles) and two records where middles is empty:

library(tidyjson)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

From these data I wish to have a dataframe with all the parent records retained and missing child-array information filled with NAs (~ left join) as such:

# A tibble: 3 x 5
    age first last  middle1 middle2
  <dbl> <chr> <chr> <chr>   <chr>  
1    32 Bob   Smith John    Rick   
2    54 Susan Doe   NA      NA
3    18 Ann   Jones NA      NA

However, extracting a nested array with some empty child arrays causes loss of their parent's information (~ inner join):

people %>% 
    spread_all() %>%  
    enter_object("name") %>% gather_array() %>% 
    spread_all() %>% select(-document.id,-array.index) %>% 
    enter_object("middles") %>% gather_array %>% 
    spread_all()  %>% select(-array.index) %>% 
    tbl_df()
# A tibble: 1 x 5
    age first last  middle1 middle2
  <dbl> <chr> <chr> <chr>   <chr>  
1    32 Bob   Smith John    Rick   

Is there a way to avoid this; i.e., to retain all the rows even when child arrays are empty?

colearendt commented 4 years ago

Thanks for reporting, and apologies for missing the SO post 😄

I think this is related to #95 and possibly others. enter_object() is the problem here, as it discards records that do not have that object (And in this case, empty is treated as "does not have").

There is a short-term solution in splitting up the pipeline and doing the left_join() yourself. Not ideal I don't think, but functional for now until we have a bit more time to think on this one 😄

library(dplyr)
library(tidyjson)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

# does not work as we want
people %>% 
  spread_all() %>%  
  enter_object("name") %>% gather_array() %>% 
  spread_all() %>% select(-document.id,-array.index) %>% 
  enter_object("middles") %>% gather_array %>% 
  spread_all()  %>% select(-array.index) %>% 
  tbl_df()
#> # A tibble: 1 x 5
#>     age first last  middle1 middle2
#>   <dbl> <chr> <chr> <chr>   <chr>  
#> 1    32 Bob   Smith John    Rick

# prepare the JSON
prep <- people %>% 
  spread_all() %>% 
  enter_object("name") %>% 
  gather_array() %>% 
  spread_all() %>% 
  select(-array.index)

# extract the sub-object
middles <- prep %>% 
  enter_object("middles") %>%
  gather_array() %>% 
  spread_all() %>% 
  select("document.id", starts_with("middle")) %>%
  as_tibble()

# join together
prep %>% 
  as_tibble() %>%
  left_join(
    middles, by = c("document.id")
    )
#> # A tibble: 3 x 6
#>   document.id   age first last  middle1 middle2
#>         <int> <dbl> <chr> <chr> <chr>   <chr>  
#> 1           1    32 Bob   Smith John    Rick   
#> 2           2    54 Susan Doe   <NA>    <NA>   
#> 3           3    18 Ann   Jones <NA>    <NA>
colearendt commented 4 years ago

And another approach that gets close (going tall first and avoiding enter_object()... but we'd really rather have a "list" column that you can then explore, rather than a character).

library(dplyr)
library(tidyr)
library(tidyjson)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')
people %>%
  spread_all() %>%
  enter_object("name") %>%
  gather_array() %>%
  gather_object() %>%
  append_values_string() %>%
  as_tibble() %>%
  spread(name, string)
#> # A tibble: 3 x 6
#>   document.id   age array.index first last  middles                             
#>         <int> <dbl>       <int> <chr> <chr> <chr>                               
#> 1           1    32           1 Bob   Smith "list(middle1 = \"John\", middle2 =…
#> 2           2    54           1 Susan Doe   NA                                  
#> 3           3    18           1 Ann   Jones NA

Created on 2019-12-15 by the reprex package (v0.2.1)

colearendt commented 4 years ago

I had a few minutes, so I tried adding a simple append_values_list() that makes the latter a bit more plausible... enter_object() is a bit more complex to dig into and will need more thought. Column types are still a little weird here with list-cols.

devtools::install_github("colearendt/tidyjson@add_append_values")
library(dplyr)
library(tidyr)
library(purrr)
library(tidyjson)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

structured <- people %>%
  spread_all() %>%
  enter_object("name") %>%
  gather_array() %>%
  gather_object() %>%
  append_values_list() %>% 
  as_tibble()

structured
#> # A tibble: 9 x 5
#>   document.id   age array.index name    list            
#>         <int> <dbl>       <int> <chr>   <list>          
#> 1           1    32           1 first   <chr [1]>       
#> 2           1    32           1 last    <chr [1]>       
#> 3           1    32           1 middles <named list [2]>
#> 4           2    54           1 first   <chr [1]>       
#> 5           2    54           1 last    <chr [1]>       
#> 6           2    54           1 middles <lgl [1]>       
#> 7           3    18           1 first   <chr [1]>       
#> 8           3    18           1 last    <chr [1]>       
#> 9           3    18           1 middles <lgl [1]>

spread_it_out <- structured %>% spread(name, list)
spread_it_out
#> # A tibble: 3 x 6
#>   document.id   age array.index first     last      middles         
#>         <int> <dbl>       <int> <list>    <list>    <list>          
#> 1           1    32           1 <chr [1]> <chr [1]> <named list [2]>
#> 2           2    54           1 <chr [1]> <chr [1]> <lgl [1]>       
#> 3           3    18           1 <chr [1]> <chr [1]> <lgl [1]>

spread_it_out %>% mutate(first = flatten_chr(first), last = flatten_chr(last))
#> # A tibble: 3 x 6
#>   document.id   age array.index first last  middles         
#>         <int> <dbl>       <int> <chr> <chr> <list>          
#> 1           1    32           1 Bob   Smith <named list [2]>
#> 2           2    54           1 Susan Doe   <lgl [1]>       
#> 3           3    18           1 Ann   Jones <lgl [1]>

Created on 2019-12-15 by the reprex package (v0.2.1)

msinjin commented 4 years ago

Thanks for reporting, and apologies for missing the SO post 😄

All good. I really wasn't sure if it was a bug or a feature!

I think this is related to #95 and possibly others. enter_object() is the problem here, as it discards records that do not have that object (And in this case, empty is treated as "does not have").

Interesting. I had considered issue #95 while researching a solution, so that makes sense.

There is a short-term solution in splitting up the pipeline and doing the left_join() yourself. Not ideal I don't think, but functional for now until we have a bit more time to think on this one 😄

I had come up with a similar left_join() work around, which I included in the SO post. It gets the job done but doubles the execution time which isn't great because I'm dealing with GBs of JSON.

I'll try out your other ideas and see how it goes.

Thanks for the quick answer!

colearendt commented 4 years ago

Another possible pattern I just added to the SO post.


library(tidyjson)
library(dplyr)
library(tidyr)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

as_tbl_json(people) %>% spread_all() %>% enter_object("name") %>% gather_array("nameid") %>%
  gather_object("key") %>% 
  {bind_rows(
    filter(., key != "middles") %>% append_values_string("value"),
    filter(., key == "middles") %>% gather_array("middleid") %>% 
      select(-key) %>%
      gather_object("key") %>%
      append_values_string("value")
  )} %>%
  # drop tbl_json
  as_tibble() %>%
  select(-document.id, -nameid, -middleid) %>%
  # could also use tidyr::pivot_wider
  tidyr::spread(key, value)
#> # A tibble: 3 x 5
#>     age first last  middle1 middle2
#>   <dbl> <chr> <chr> <chr>   <chr>  
#> 1    18 Ann   Jones <NA>    <NA>   
#> 2    32 Bob   Smith John    Rick   
#> 3    54 Susan Doe   <NA>    <NA>

Created on 2020-06-28 by the reprex package (v0.3.0)