colearendt / tidyjson

Tidy your JSON data in R with tidyjson
Other
182 stars 15 forks source link

json_structure requires document.id #86

Closed jeremystan closed 5 years ago

jeremystan commented 8 years ago

This works

data_frame(id = 1, json = '"a"') %>% as.tbl_json(json.column = "json") %>% json_structure

But this does not

data_frame(id = 1, json = '["a"]') %>% as.tbl_json(json.column = "json") %>% json_structure
colearendt commented 7 years ago

Also added a test to support the object case:

data_frame(id = 1, json = '{"a":1}') %>% as.tbl_json(json.column = "json") %>% json_structure

Resolved by imputing document.id in json_structure_init() with row_number() when document.id was not present. I thought this was a better solution than disregarding document.id since it is advertised in the docs as a return column. It also allows output that identifies which record came from which row... although document.id is left alone if the column exists on input, which can make for some non-intuitive results...

## these give different output in document.id
'[{"a":1},{"a":2}]' %>% gather_array() %>% json_structure()
'[{"a":1},{"a":2}]' %>% gather_array() %>% select(-document.id) %>% json_structure()

I'm wondering whether the implementation of json_structure() could be improved at all, if its scope should be more narrowly defined (i.e. does it need to leave the tbl_json as-is or return an object focused on structure?) or perhaps I am just struggling to understand its case for use.

One note - not sure if it is intentional that any tbl_json structure already present is included in the output of json_structure(), because json_structure_init() does not use transmute()? These values are only included on the parent object, though (i.e. see the id field on above examples, or array.index below)

## array.index field is preserved
'[{"a":1},{"a":2}]' %>% gather_array() %>% json_structure()