colearendt / tidyjson

Tidy your JSON data in R with tidyjson
Other
184 stars 15 forks source link

crashing unnest() function #110

Closed sbilge closed 5 years ago

sbilge commented 5 years ago

When tidyjson_0.2.3 is used with unnest() function, it crashes with the error: Error: nrow(df) not equal to length(json.list)

With version 0.2.1, it was working fine.

Versions: R 3.4.4 tidyjson_0.2.3.9000 dplyr_0.8.0.1 tidyr_0.8.3

colearendt commented 5 years ago

Thanks for the report @sbilge ! I think this is a good case to keep in mind. Do you have a simple reprex that we can use for exploring?

Note that this will be the case with many multi row operations, because it is discarding the notion of a tbl_json, because a single JSON row is becoming many. At present, we are throwing an error in this case, although there are other behaviors that we might explore.

The simplest solution would be to change your pipeline:

# instead of
object %>% unnest()

# try this
object %>% as_tibble() %>% unnest()

as_tibble should drop the tbl_json class and allow you to interface with the object like a normal tibble (i.e. forcibly discard the notion of the JSON attribute that you will no longer be needing).

If you are trying to unnest an object that is in your JSON, you might look at the various spread and gather verbs that could also be an alternative to your approach! (again, a simple reprex may help us make a better recommendation here).

sbilge commented 5 years ago

@colearendt Thank you very much for your answer. object %>% as_tibble() %>% unnest() worked.

I attached a simplified input file. It seems like it crashes when the object "drugs" has a null "drug_pmid" value.

Here is the reprex:

list.of.packages <- c("dplyr", "dtplyr", "tidyr", "stringr", "tidyjson")
lapply(list.of.packages, library, character.only=T)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'tidyjson'
#> The following object is masked from 'package:dplyr':
#> 
#>     bind_rows
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> [[1]]
#> [1] "dplyr"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
#> [7] "methods"   "base"     
#> 
#> [[2]]
#> [1] "dtplyr"    "dplyr"     "stats"     "graphics"  "grDevices" "utils"    
#> [7] "datasets"  "methods"   "base"     
#> 
#> [[3]]
#>  [1] "tidyr"     "dtplyr"    "dplyr"     "stats"     "graphics" 
#>  [6] "grDevices" "utils"     "datasets"  "methods"   "base"     
#> 
#> [[4]]
#>  [1] "stringr"   "tidyr"     "dtplyr"    "dplyr"     "stats"    
#>  [6] "graphics"  "grDevices" "utils"     "datasets"  "methods"  
#> [11] "base"     
#> 
#> [[5]]
#>  [1] "tidyjson"  "stringr"   "tidyr"     "dtplyr"    "dplyr"    
#>  [6] "stats"     "graphics"  "grDevices" "utils"     "datasets" 
#> [11] "methods"   "base"

biograph_json <- as.tbl_json("/PATH/TO/biograph_json.json")

biograph_drugs <- biograph_json %>%
  enter_object("_items") %>% gather_array() %>%
  spread_values(
    gene_symbol = jstring("gene_symbol"),
    hgnc_id = jstring("hgnc_id")
  ) %>%
  dplyr::select(-array.index) %>%
  enter_object("drugs") %>% gather_array() %>%
  spread_values(
    ATC_code = jstring("ATC_code"),
    drug_name = jstring("drug_name"),
    drug_source_name = jstring("source_name"),
    drugbank_id = jstring("drugbank_id"),
    target_action = jstring("target_action"),
    drug_pmid = jstring("pmid"),
    interaction_type = jstring("interaction_type"),
    is_cancer_drug = jlogical("is_cancer_drug")
  ) %>%
  mutate(hgnc_id = as.integer(hgnc_id)) %>%
  mutate(drug_pmid = ifelse(drug_pmid == "null", NA, drug_pmid)) %>%
  # make a row for every pubmed id
  mutate(drug_pmid = str_split(drug_pmid, "\\|")) %>%
  unnest(drug_pmid) %>%
  dplyr::select(-document.id, -array.index)
#> Error: nrow(df) not equal to length(json.list)

Created on 2019-03-14 by the reprex package (v0.2.1)

biograph_json.json.zip

colearendt commented 5 years ago

Confirmed that this should be working in the latest dev version 0.2.3.9000. Thanks for the report!