colearendt / tidyjson

Tidy your JSON data in R with tidyjson
Other
182 stars 15 forks source link

How to treat nested arrays? #48

Open jeremystan opened 8 years ago

jeremystan commented 8 years ago

Nested arrays are difficult to work with. For example,

x <- '[[1, 2], 1]' %>% gather_array %>% json_types
x
#>   document.id array.index   type
#> 1           1           1  array
#> 2           1           2 number

At this point, there is no way to gather the next array unless we filter on type == 'array'.

x %>% gather_array("level2")
#> Error in gather_array(., "level2") : 1 records are not arrays
x %>% filter(type == "array") %>% gather_array("level2")
#>   document.id array.index  type level2
#> 1           1           1 array      1
#> 2           1           1 array      2

append_values_number works, but returns NA for the array, and recursive = TRUE doesn't work through the second level array. Further, it could be that the types are mixed.

colearendt commented 7 years ago

A similar-ish case that may be worth considering here is arrays that have been improperly serialized to an object when there is only one element. I.e. JSON like:

x <- '[{"id": 1, "list":[1,2,3]}, {"id": 2, "list": 4}]'
x %>% gather_array() %>% 
  spread_values(id=jnumber('id')) %>%
  enter_object('list') %>%
  json_types()

While technically not valid, it may still be nice to have a way to work with it. The work-around solution here is the same - filtering on type == 'array'.

I also posted the workaround in an actual question someone had here

colearendt commented 7 years ago

Honestly, it seems all that is really needed here is a way to bypass the type-checking. The function itself already handles these cases fairly nicely when the type-check is removed. Not sure whether the better behavior is a parameter in the function or an environmental variable like tidyjson.typesafety or something like that.

By commenting out the type-checking lines in the gather_factory:

x <- "[{\"id\": 1, \"list\":[1,2,3]}, {\"id\": 2, \"list\": 4}]"
x %>% gather_array() %>% enter_object("list") %>% json_types() %>% 
gather_array("array.index2") %>% 
  json_types("type2")
#> # A tbl_json: 4 x 5 tibble with a "JSON" attribute
#>   `attr(., "JSON")` document.id array.index   type array.index2  type2
#>               <chr>       <int>       <int> <fctr>        <int> <fctr>
#> 1                 1           1           1  array            1 number
#> 2                 2           1           1  array            2 number
#> 3                 3           1           1  array            3 number
#> 4                 4           1           2 number            1 number

x <- "[[1, 2], 1]" %>% gather_array %>% json_types
x %>% gather_array("array.index2") %>% json_types("type2")
#> # A tbl_json: 3 x 5 tibble with a "JSON" attribute
#>   `attr(., "JSON")` document.id array.index   type array.index2  type2
#>               <chr>       <int>       <int> <fctr>        <int> <fctr>
#> 1                 1           1           1  array            1 number
#> 2                 2           1           1  array            2 number
#> 3                 1           1           2 number            1 number

Although perhaps it would be preferable for the array.index2 to be NA and thereby illustrate that it was not an array? Not sure which behavior is more consistent and desirable.

colearendt commented 7 years ago

The change above is very problematic for objects, for which keys are silently thrown away, so a better proposal is required... maybe a way to not touch bad_types and preserve them as NA?

'{"a":"one","b":"two","c":"three"}' %>% 
  gather_array() %>% 
  append_values_string()
## A tbl_json: 3 x 3 tibble with a "JSON" attribute
#  `attr(., "JSON")` document.id array.index string
#              <chr>       <int>       <int>  <chr>
#1         "\"one\""           1           1    one
#2         "\"two\""           1           2    two
#3       "\"three\""           1           3  three