colearendt / tidyjson

Tidy your JSON data in R with tidyjson
Other
182 stars 14 forks source link

Using dplyr::group_by() with tbl_json objects changes the type of `..JSON` column #135

Open ramiromagno opened 3 years ago

ramiromagno commented 3 years ago

Here ..JSON is of type character:

> tidyjson::worldbank %>% tidyjson::spread_all()
# A tibble: 500 x 9
   ..JSON       document.id boardapprovaldate  closingdate    countryshortname project_name                   regionname      totalamt `_id.$oid`    
   <chr>              <int> <chr>              <chr>          <chr>            <chr>                          <chr>              <dbl> <chr>         
 1 "{\"_id\":{…           1 2013-11-12T00:00:… 2018-07-07T00… Ethiopia         Ethiopia General Education Qu… Africa            1.30e8 52b213b38594d…
 2 "{\"_id\":{…           2 2013-11-04T00:00:… NA             Tunisia          TN: DTF Social Protection Ref… Middle East an…   0.     52b213b38594d…
 3 "{\"_id\":{…           3 2013-11-01T00:00:… NA             Tuvalu           Tuvalu Aviation Investment Pr… East Asia and …   6.06e6 52b213b38594d…
 4 "{\"_id\":{…           4 2013-10-31T00:00:… NA             Yemen, Republic… Gov't and Civil Society Organ… Middle East an…   0.     52b213b38594d…
 5 "{\"_id\":{…           5 2013-10-31T00:00:… 2019-04-30T00… Lesotho          Second Private Sector Competi… Africa            1.31e7 52b213b38594d…
 6 "{\"_id\":{…           6 2013-10-31T00:00:… NA             Kenya            Additional Financing for Cash… Africa            1.00e7 52b213b38594d…
 7 "{\"_id\":{…           7 2013-10-29T00:00:… 2019-06-30T00… India            National Highways Interconnec… South Asia        5.00e8 52b213b38594d…
 8 "{\"_id\":{…           8 2013-10-29T00:00:… NA             China            China Renewable Energy Scale-… East Asia and …   0.     52b213b38594d…
 9 "{\"_id\":{…           9 2013-10-29T00:00:… 2018-12-31T00… India            Rajasthan Road Sector Moderni… South Asia        1.60e8 52b213b38594d…
10 "{\"_id\":{…          10 2013-10-29T00:00:… 2014-12-31T00… Morocco          MA Accountability and Transpa… Middle East an…   2.00e8 52b213b38594d…
# … with 490 more rows

Now is of type list (scroll to the right):

> tidyjson::worldbank %>% tidyjson::spread_all() %>% dplyr::group_by(countryshortname)
# A tibble: 500 x 9
# Groups:   countryshortname [118]
   document.id boardapprovaldate  closingdate    countryshortname  project_name                     regionname      totalamt `_id.$oid`     ..JSON   
         <int> <chr>              <chr>          <chr>             <chr>                            <chr>              <dbl> <chr>          <list>   
 1           1 2013-11-12T00:00:… 2018-07-07T00… Ethiopia          Ethiopia General Education Qual… Africa            1.30e8 52b213b38594d… <named l…
 2           2 2013-11-04T00:00:… NA             Tunisia           TN: DTF Social Protection Refor… Middle East an…   0.     52b213b38594d… <named l…
 3           3 2013-11-01T00:00:… NA             Tuvalu            Tuvalu Aviation Investment Proj… East Asia and …   6.06e6 52b213b38594d… <named l…
 4           4 2013-10-31T00:00:… NA             Yemen, Republic … Gov't and Civil Society Organiz… Middle East an…   0.     52b213b38594d… <named l…
 5           5 2013-10-31T00:00:… 2019-04-30T00… Lesotho           Second Private Sector Competiti… Africa            1.31e7 52b213b38594d… <named l…
 6           6 2013-10-31T00:00:… NA             Kenya             Additional Financing for Cash T… Africa            1.00e7 52b213b38594d… <named l…
 7           7 2013-10-29T00:00:… 2019-06-30T00… India             National Highways Interconnecti… South Asia        5.00e8 52b213b38594d… <named l…
 8           8 2013-10-29T00:00:… NA             China             China Renewable Energy Scale-Up… East Asia and …   0.     52b213b38594d… <named l…
 9           9 2013-10-29T00:00:… 2018-12-31T00… India             Rajasthan Road Sector Moderniza… South Asia        1.60e8 52b213b38594d… <named l…
10          10 2013-10-29T00:00:… 2014-12-31T00… Morocco           MA Accountability and Transpare… Middle East an…   2.00e8 52b213b38594d… <named l…
# … with 490 more rows
colearendt commented 3 years ago

Thanks for reporting!! That is surprising behavior indeed. I'll take a look and see if this can be improved. To be fair, ..JSON is actually always a list, but it is a "hidden" column, so we just print a character overview of what it looks like in JSON form.

I suspect the problem that is happening here is that the "grouped data frame" is no longer a tbl_json. Do you plan to do anything else with the JSON data after doing the group_by operations? It'd be good to get a sense for the workflow you are shooting for here!

ramiromagno commented 3 years ago

Here's some more context:

#' @importFrom rlang .data
collect_samples <- function(tbl_json) {

  samples_variants <- tbl_json %>%
    tidyjson::enter_object('samples_variants') %>%
    tidyjson::gather_array(column.name = 'sample_id') %>%
    dplyr::select(-'sample_id')

  samples_training <- tbl_json %>%
    tidyjson::enter_object('samples_training') %>%
    tidyjson::gather_array(column.name = 'sample_id') %>%
    dplyr::select(-'sample_id')

  all_samples <-
    tidyjson::bind_rows(samples_variants, samples_training) %>%
    dplyr::group_by(.data$..page, .data$array.index) %>%
    dplyr::mutate(., sample_id = seq_len(dplyr::n()), .after = 'array.index') %>%
    dplyr::arrange('sample_id', .by_group = TRUE) %>%
    dplyr::ungroup() %>%
    tidyjson::as.tbl_json(json.column = '..JSON') # Needed because of https://github.com/colearendt/tidyjson/issues/135.

  return(all_samples)
}

I found that using tidyjson::as.tbl_json(json.column = '..JSON') works as a workaround.

colearendt commented 3 years ago

That makes a lot of sense! Thank you for the context!

Grouped mutates are perfect. One of the concerns we have for supporting grouped tibbles is "summarize" operations, which will necessarily destroy the JSON data and make future tidyjson operations largely meaningless. It'd be great if we could find a nice middle way that supports grouped tibbles for certain operations, or at least makes the confusing state here more clear / understandable. On Feb 6 2021, at 10:28 am, Ramiro Magno notifications@github.com wrote:

Here's some more context:

' @importFrom rlang .datacollect_samples <- function(tbl_json) {

samples_variants <- tbl_json %>% tidyjson::enter_object('samples_variants') %>% tidyjson::gather_array(column.name = 'sample_id') %>% dplyr::select(-'sample_id')

samples_training <- tbl_json %>% tidyjson::enter_object('samples_training') %>% tidyjson::gather_array(column.name = 'sample_id') %>% dplyr::select(-'sample_id')

all_samples <- tidyjson::bind_rows(samples_variants, samples_training) %>% dplyr::group_by(.data$..page, .data$array.index) %>% dplyr::mutate(., sample_id = seq_len(dplyr::n()), .after = 'array.index') %>% dplyr::arrange('sample_id', .by_group = TRUE) %>% dplyr::ungroup() %>% tidyjson::as.tbl_json(json.column = '..JSON') # Needed because of https://github.com/colearendt/tidyjson/issues/135.

return(all_samples) } I found that using tidyjson::as.tbl_json(json.column = '..JSON') works as a workaround.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub (https://github.com/colearendt/tidyjson/issues/135#issuecomment-774494019), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AFQBVVWYAPFBVYSJ25RY6HDS5VNZHANCNFSM4XF2JK5A).

ramiromagno commented 3 years ago

btw, I should probably create a separate issue, but something I see doing myself too often is drop a recently created index column with gather_array, like in the example above:

    tidyjson::gather_array(column.name = 'sample_id') %>%
    dplyr::select(-'sample_id')

you think it would be possible to allow that argument column.name to accept a special value that would automatically drop the index column?

ramiromagno commented 3 years ago

Perhaps?

    tidyjson::gather_array(column.name = NULL)
colearendt commented 3 years ago

Yeah, I like that idea!

Sent from my iPhone

On Feb 6, 2021, at 10:41 AM, Ramiro Magno notifications@github.com wrote:



Perhaps?

tidyjson::gather_array(column.name = NULL)

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/colearendt/tidyjson/issues/135#issuecomment-774496056, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFQBVVW5LATS4FY65GXZHMDS5VPLBANCNFSM4XF2JK5A.