Open msinjin opened 4 years ago
Thanks for reporting, and apologies for missing the SO post 😄
I think this is related to #95 and possibly others. enter_object()
is the problem here, as it discards records that do not have that object (And in this case, empty is treated as "does not have").
There is a short-term solution in splitting up the pipeline and doing the left_join()
yourself. Not ideal I don't think, but functional for now until we have a bit more time to think on this one 😄
library(dplyr)
library(tidyjson)
people <- c('{"age": 32, "name": [{"first": "Bob", "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
'{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
'{"age": 18, "name": [{"first": "Ann", "last": "Jones", "middles":[]}]}')
# does not work as we want
people %>%
spread_all() %>%
enter_object("name") %>% gather_array() %>%
spread_all() %>% select(-document.id,-array.index) %>%
enter_object("middles") %>% gather_array %>%
spread_all() %>% select(-array.index) %>%
tbl_df()
#> # A tibble: 1 x 5
#> age first last middle1 middle2
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 32 Bob Smith John Rick
# prepare the JSON
prep <- people %>%
spread_all() %>%
enter_object("name") %>%
gather_array() %>%
spread_all() %>%
select(-array.index)
# extract the sub-object
middles <- prep %>%
enter_object("middles") %>%
gather_array() %>%
spread_all() %>%
select("document.id", starts_with("middle")) %>%
as_tibble()
# join together
prep %>%
as_tibble() %>%
left_join(
middles, by = c("document.id")
)
#> # A tibble: 3 x 6
#> document.id age first last middle1 middle2
#> <int> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 32 Bob Smith John Rick
#> 2 2 54 Susan Doe <NA> <NA>
#> 3 3 18 Ann Jones <NA> <NA>
And another approach that gets close (going tall first and avoiding enter_object()
... but we'd really rather have a "list" column that you can then explore, rather than a character).
library(dplyr)
library(tidyr)
library(tidyjson)
people <- c('{"age": 32, "name": [{"first": "Bob", "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
'{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
'{"age": 18, "name": [{"first": "Ann", "last": "Jones", "middles":[]}]}')
people %>%
spread_all() %>%
enter_object("name") %>%
gather_array() %>%
gather_object() %>%
append_values_string() %>%
as_tibble() %>%
spread(name, string)
#> # A tibble: 3 x 6
#> document.id age array.index first last middles
#> <int> <dbl> <int> <chr> <chr> <chr>
#> 1 1 32 1 Bob Smith "list(middle1 = \"John\", middle2 =…
#> 2 2 54 1 Susan Doe NA
#> 3 3 18 1 Ann Jones NA
Created on 2019-12-15 by the reprex package (v0.2.1)
I had a few minutes, so I tried adding a simple append_values_list()
that makes the latter a bit more plausible... enter_object()
is a bit more complex to dig into and will need more thought. Column types are still a little weird here with list-cols.
devtools::install_github("colearendt/tidyjson@add_append_values")
library(dplyr)
library(tidyr)
library(purrr)
library(tidyjson)
people <- c('{"age": 32, "name": [{"first": "Bob", "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
'{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
'{"age": 18, "name": [{"first": "Ann", "last": "Jones", "middles":[]}]}')
structured <- people %>%
spread_all() %>%
enter_object("name") %>%
gather_array() %>%
gather_object() %>%
append_values_list() %>%
as_tibble()
structured
#> # A tibble: 9 x 5
#> document.id age array.index name list
#> <int> <dbl> <int> <chr> <list>
#> 1 1 32 1 first <chr [1]>
#> 2 1 32 1 last <chr [1]>
#> 3 1 32 1 middles <named list [2]>
#> 4 2 54 1 first <chr [1]>
#> 5 2 54 1 last <chr [1]>
#> 6 2 54 1 middles <lgl [1]>
#> 7 3 18 1 first <chr [1]>
#> 8 3 18 1 last <chr [1]>
#> 9 3 18 1 middles <lgl [1]>
spread_it_out <- structured %>% spread(name, list)
spread_it_out
#> # A tibble: 3 x 6
#> document.id age array.index first last middles
#> <int> <dbl> <int> <list> <list> <list>
#> 1 1 32 1 <chr [1]> <chr [1]> <named list [2]>
#> 2 2 54 1 <chr [1]> <chr [1]> <lgl [1]>
#> 3 3 18 1 <chr [1]> <chr [1]> <lgl [1]>
spread_it_out %>% mutate(first = flatten_chr(first), last = flatten_chr(last))
#> # A tibble: 3 x 6
#> document.id age array.index first last middles
#> <int> <dbl> <int> <chr> <chr> <list>
#> 1 1 32 1 Bob Smith <named list [2]>
#> 2 2 54 1 Susan Doe <lgl [1]>
#> 3 3 18 1 Ann Jones <lgl [1]>
Created on 2019-12-15 by the reprex package (v0.2.1)
Thanks for reporting, and apologies for missing the SO post 😄
All good. I really wasn't sure if it was a bug or a feature!
I think this is related to #95 and possibly others.
enter_object()
is the problem here, as it discards records that do not have that object (And in this case, empty is treated as "does not have").
Interesting. I had considered issue #95 while researching a solution, so that makes sense.
There is a short-term solution in splitting up the pipeline and doing the
left_join()
yourself. Not ideal I don't think, but functional for now until we have a bit more time to think on this one 😄
I had come up with a similar left_join()
work around, which I included in the SO post. It gets the job done but doubles the execution time which isn't great because I'm dealing with GBs of JSON.
I'll try out your other ideas and see how it goes.
Thanks for the quick answer!
Another possible pattern I just added to the SO post.
library(tidyjson)
library(dplyr)
library(tidyr)
people <- c('{"age": 32, "name": [{"first": "Bob", "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
'{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
'{"age": 18, "name": [{"first": "Ann", "last": "Jones", "middles":[]}]}')
as_tbl_json(people) %>% spread_all() %>% enter_object("name") %>% gather_array("nameid") %>%
gather_object("key") %>%
{bind_rows(
filter(., key != "middles") %>% append_values_string("value"),
filter(., key == "middles") %>% gather_array("middleid") %>%
select(-key) %>%
gather_object("key") %>%
append_values_string("value")
)} %>%
# drop tbl_json
as_tibble() %>%
select(-document.id, -nameid, -middleid) %>%
# could also use tidyr::pivot_wider
tidyr::spread(key, value)
#> # A tibble: 3 x 5
#> age first last middle1 middle2
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 18 Ann Jones <NA> <NA>
#> 2 32 Bob Smith John Rick
#> 3 54 Susan Doe <NA> <NA>
Created on 2020-06-28 by the reprex package (v0.3.0)
[cross posted on stack overflow, getting no love]
It seems as though tidyjson uses an inner-join-like behaviour on nested arrays, thus dropping records with empty child arrays. Is there a way to get left-join-like behaviour instead, filling with
NA
s?For example, these fake data have one record with a populated nested array (
middles
) and two records wheremiddles
is empty:From these data I wish to have a dataframe with all the parent records retained and missing child-array information filled with
NA
s (~ left join) as such:However, extracting a nested array with some empty child arrays causes loss of their parent's information (~ inner join):
Is there a way to avoid this; i.e., to retain all the rows even when child arrays are empty?