colearendt / tidyjson

Tidy your JSON data in R with tidyjson
Other
182 stars 14 forks source link

`enter_object` : allow keeping row if object does not exist #134

Open ramiromagno opened 3 years ago

ramiromagno commented 3 years ago

Feature request

From the documentation details about enter_object():

After using enter_object, all further tidyjson calls happen inside the referenced object (all other JSON data outside the object is discarded). If the object doesn't exist for a given row / index, then that row will be discarded.

Could you give the user the option to not discard?

From the source code of enter_object it does not seem difficult to allow this:

function (.x, ...)
{
    if (!is.tbl_json(.x)) 
        .x <- as.tbl_json(.x)
    path <- path(...)
    json <- json_get(.x)
    json <- purrr::map(json, path %>% as.list)
    tbl_json(.x, json, drop.null.json = TRUE)
}

could it be changed to this code?

function (.x, ..., drop.null.json = TRUE)
{
    if (!is.tbl_json(.x)) 
        .x <- as.tbl_json(.x)
    path <- path(...)
    json <- json_get(.x)
    json <- purrr::map(json, path %>% as.list)
    tbl_json(.x, json, drop.null.json = drop.null.json)
}

Motivation

Perhaps I am not using tidyjson idiomatically, but I would like to use the code below to extract a json array and bind it as a new column. In the example below I have a tbl json with 3 rows: in the third row the object "associated_pgs_ids" is null. Therefore I cannot take advantage of this function get_column_char() because tidyjson::enter_object will return only two rows instead of three, not allowing me to further bind this column to the starting tibble.

get_column_chr()

    get_column_chr <- function(tbl_json, json_object, col = json_object, only_col = TRUE) {

      tbl_json %>%
        tidyjson::enter_object({{ json_object }}) %>%
        tidyjson::json_get_column(column_name = {{ col }}) %>%
        dplyr::mutate({{ col }} := purrr::map(.data[[col]], as.character)) %>%
        `if`(only_col, tidyjson::as_tibble(.[col]), .) # as_tibble necessary to drop ..JSON col.
    }

Example code

    library(magrittr)
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    library(tidyjson)
    #> 
    #> Attaching package: 'tidyjson'
    #> The following object is masked from 'package:stats':
    #> 
    #>     filter

    tbl_json1 <-
      structure(
        list(
          ..resource = c(
            "https://www.pgscatalog.org/rest/publication/all?offset=0&limit=20&format=json",
            "https://www.pgscatalog.org/rest/publication/all?offset=0&limit=20&format=json",
            "https://www.pgscatalog.org/rest/publication/all?offset=0&limit=20&format=json"
          ),
          ..timestamp = structure(
            c(1611422943.87465, 1611422943.87465,
              1611422943.87465),
            tzone = "",
            class = c("POSIXct", "POSIXt")
          ),
          ..page = c(1L, 1L, 1L),
          array.index = 6:8,
          id = c("PGP000006",
                 "PGP000007", "PGP000008"),
          pubmed_id = c("30104762", "30309464",
                        "31184202"),
          publication_date = c("2018-08-13", "2018-10-01",
                               "2019-06-11"),
          publication = c("Nat Genet", "J Am Coll Cardiol",
                          "Circ Genom Precis Med"),
          title = c(
            "Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations.",
            "Genomic Risk Prediction of Coronary Artery Disease in 480,000 Adults: Implications for Primary Prevention.",
            "Validation of Genome-Wide Polygenic Risk Scores for Coronary Artery Disease in French Canadians."
          ),
          author_fullname = c("Khera AV", "Inouye M", "Wünnemann F"),
          doi = c(
            "10.1038/s41588-018-0183-z",
            "10.1016/j.jacc.2018.07.079",
            "10.1161/CIRCGEN.119.002481"
          ),
          authors = c(
            "Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, Natarajan P, Lander ES, Lubitz SA, Ellinor PT, Kathiresan S.",
            "Inouye M, Abraham G, Nelson CP, Wood AM, Sweeting MJ, Dudbridge F, Lai FY, Kaptoge S, Brozynska M, Wang T, Ye S, Webb TR, Rutter MK, Tzoulaki I, Patel RS, Loos RJF, Keavney B, Hemingway H, Thompson J, Watkins H, Deloukas P, Di Angelantonio E, Butterworth AS, Danesh J, Samani NJ, UK Biobank CardioMetabolic Consortium CHD Working Group.",
            "Wünnemann F, Sin Lo K, Langford-Avelar A, Busseuil D, Dubé MP, Tardif JC, Lettre G."
          ),
          ..JSON = list(
            list(
              id = "PGP000006",
              title = "Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations.",
              doi = "10.1038/s41588-018-0183-z",
              PMID = 30104762L,
              journal = "Nat Genet",
              firstauthor = "Khera AV",
              date_publication = "2018-08-13",
              authors = "Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, Natarajan P, Lander ES, Lubitz SA, Ellinor PT, Kathiresan S.",
              associated_pgs_ids = list(
                "PGS000013",
                "PGS000014",
                "PGS000015",
                "PGS000016",
                "PGS000017"
              )
            ),
            list(
              id = "PGP000007",
              title = "Genomic Risk Prediction of Coronary Artery Disease in 480,000 Adults: Implications for Primary Prevention.",
              doi = "10.1016/j.jacc.2018.07.079",
              PMID = 30309464L,
              journal = "J Am Coll Cardiol",
              firstauthor = "Inouye M",
              date_publication = "2018-10-01",
              authors = "Inouye M, Abraham G, Nelson CP, Wood AM, Sweeting MJ, Dudbridge F, Lai FY, Kaptoge S, Brozynska M, Wang T, Ye S, Webb TR, Rutter MK, Tzoulaki I, Patel RS, Loos RJF, Keavney B, Hemingway H, Thompson J, Watkins H, Deloukas P, Di Angelantonio E, Butterworth AS, Danesh J, Samani NJ, UK Biobank CardioMetabolic Consortium CHD Working Group.",
              associated_pgs_ids = list("PGS000018")
            ),
            list(
              id = "PGP000008",
              title = "Validation of Genome-Wide Polygenic Risk Scores for Coronary Artery Disease in French Canadians.",
              doi = "10.1161/CIRCGEN.119.002481",
              PMID = 31184202L,
              journal = "Circ Genom Precis Med",
              firstauthor = "Wünnemann F",
              date_publication = "2019-06-11",
              authors = "Wünnemann F, Sin Lo K, Langford-Avelar A, Busseuil D, Dubé MP, Tardif JC, Lettre G.",
              associated_pgs_ids = list()
            )
          )
        ),
        row.names = c(NA, 3L),
        class = c("tbl_json",
                  "tbl_df", "tbl", "data.frame")
      )

    get_column_chr <- function(tbl_json, json_object, col = json_object, only_col = TRUE) {

      tbl_json %>%
        tidyjson::enter_object({{ json_object }}) %>%
        tidyjson::json_get_column(column_name = {{ col }}) %>%
        dplyr::mutate({{ col }} := purrr::map(.data[[col]], as.character)) %>%
        `if`(only_col, tidyjson::as_tibble(.[col]), .) # as_tibble necessary to drop ..JSON col.
    }

    tbl_json1 %>%
      dplyr::bind_cols(., get_column_chr(., 'associated_pgs_ids', 'pgs_id'))
    #> Error: Can't recycle `..1` (size 3) to match `..2` (size 2).

    tbl_json1[1:2, ] %>%
      dplyr::bind_cols(., get_column_chr(., 'associated_pgs_ids', 'pgs_id'))
    #> # A tbl_json: 2 x 14 tibble with a "JSON" attribute
    #>   ..JSON ..resource ..timestamp         ..page array.index id    pubmed_id
    #>   <chr>  <chr>      <dttm>               <int>       <int> <chr> <chr>    
    #> 1 "{\"i… https://w… 2021-01-23 17:29:03      1           6 PGP0… 30104762 
    #> 2 "{\"i… https://w… 2021-01-23 17:29:03      1           7 PGP0… 30309464 
    #> # … with 7 more variables: publication_date <chr>, publication <chr>,
    #> #   title <chr>, author_fullname <chr>, doi <chr>, authors <chr>, pgs_id <list>
ramiromagno commented 3 years ago

This seems to be working on my side:

enter_object2 <- function (.x, ..., drop.null.json = TRUE) {
  if (!tidyjson::is.tbl_json(.x))
    .x <- tidyjson::as.tbl_json(.x)

  path <- tidyjson:::path(...)
  json <- tidyjson::json_get(.x)
  json <- purrr::map(json, path %>% as.list)
  tidyjson::tbl_json(.x, json, drop.null.json = drop.null.json)
}

It's only a bit risky because I am now depending on the internal function tidyjson:::path(...), but it's the only one.

colearendt commented 3 years ago

Awesome!! Thanks for reporting this - this is definitely a confusing part of the package. #121 is where we have tracked this in the past, but you have done much more on the topic than anyone previously!

Would you be interested in PRing your function change and adding some tests? I am inclined to feel drop_null_json would be a better naming convention for the new argument. Alternatively, perhaps drop = TRUE would be a better default. I.e. looking at tidyr::spread, it seems that "unexplained" references to the word "drop" can be contextualized by help docs / etc.

ramiromagno commented 2 years ago

What about .drop = TRUE?

marklyng commented 5 months ago

Any progress on this issue? Also, any thoughts on implementing something similar to spread_all()?