colearendt / tidyjson

Tidy your JSON data in R with tidyjson
Other
182 stars 14 forks source link

Failing Tests for print.tbl_json #99

Open colearendt opened 7 years ago

colearendt commented 7 years ago

At present, tests are failing for print.tbl_json due to tidyverse/tibble#259 since the new tibble release, I presume.

print(as.tbl_json('"a"'))         
#> # A tbl_json: 1 x 1 tibble with a "JSON" attribute
#>   `attr(., "JSON")` document.id
#>               <chr>       <int>
#> 1           "\"a\""           1
krlmlr commented 7 years ago

I'm not familiar with this package, but I wonder why as.tbl_json() keeps the quotes. Would it be an option to use a list to store data of varying type?

colearendt commented 7 years ago

Oh yes, the JSON data is stored in a list at attr(., 'JSON'), but the print.tbl_json method presents the quotes in a column because of JSON's syntax that heavily uses double quotes (for strings, keys, etc.). The column is manufactured in the print method for easy viewing of the raw JSON along with how it relates to the tbl_df. Perhaps a more illustrative example of the problem in this context:

json <- '{"first": "frodo", "title": "hobbit"}'

tj <- as.tbl_json(json)                        

print(tj)                                      
#> # A tbl_json: 1 x 1 tibble with a "JSON" attribute
#>         `attr(., "JSON")` document.id
#>                     <chr>       <int>
#> 1 "{\"first\":\"frodo..."           1

print(attr(tj,'JSON'))                         
#> [[1]]
#> [[1]]$first
#> [1] "frodo"
#> 
#> [[1]]$title
#> [1] "hobbit"

The issue I opened with tibble is really a question of whether tidyjson should change it's print method to accommodate the new parsing behavior, or if the legacy behavior is preferable throughout. The previous print method was much cleaner. Also included an example of the use of the package to build a simple tibble:

library(tidyjson)

json <- "{\"first\": \"frodo\", \"title\": \"hobbit\"}"

tj <- as.tbl_json(json)

print(tj)
#> # A tbl_json: 1 x 1 tibble with a "JSON" attribute
#>    `attr(., "JSON")` document.id
#>                <chr>       <int>
#> 1 {"first":"frodo...           1

## Using the package
tj %>% spread_all()
#> # A tbl_json: 1 x 3 tibble with a "JSON" attribute
#>    `attr(., "JSON")` document.id first  title
#>                <chr>       <int> <chr>  <chr>
#> 1 {"first":"frodo...           1 frodo hobbit

tj %>% gather_object %>% append_values_string
#> # A tbl_json: 2 x 3 tibble with a "JSON" attribute
#>   `attr(., "JSON")` document.id  name string
#>               <chr>       <int> <chr>  <chr>
#> 1           "frodo"           1 first  frodo
#> 2          "hobbit"           1 title hobbit
krlmlr commented 7 years ago

Maybe define a simple json_str class with a custom format() method?

colearendt commented 7 years ago

Ahh interesting - that makes sense. So I expect that would override the print.tbl_df's handling of the column, were it of that class?

krlmlr commented 7 years ago

I guess so. With the upcoming colformat you could even add color.

colearendt commented 7 years ago

Wow. Very nice. I will explore - much appreciated! I like that solution better than parsing the output of print.tbl_df after using capture.output. Of course, I am not the package maintainer, so I cannot speak for him.

colearendt commented 7 years ago

@krlmlr Any thoughts/direction on the proposed implementation? I read through utils-format.R for tibble and it does not seem that the format method of any column is called, so I am not quite sure how it would be possible to do as you suggested. For list-columns, the list is printed with type_sum or obj_sum. I had a bit of luck defining type_sum.json_str, but double quotes are still escaped.

j <- structure('{"a":2, "b":4}', class='json_str')
format.json_str <- function(x, ...) {             
return(stringr::str_replace_all(x,'"','Z'))       
}                                                 

## works... not in the tibble                     
format(j)                                         
#> [1] "{ZaZ:2, ZbZ:4}"
data_frame(a=j)                                   
#> # A tibble: 1 x 1
#>                      a
#>                  <chr>
#> 1 "{\"a\":2, \"b\":4}"

## more promising
type_sum.json_str <- function(x) { return(stringr::str_replace_all(x,'"','Z'))}
data_frame(a=list(j))
# # A tibble: 1 x 1
# a
# <list>
#   1 <{ZaZ:2, ZbZ:4}>

## still does not work with quotes
type_sum.json_str <- function(x) { return(stringr::str_replace_all(x,'\\"','z"z'))}
data_frame(a=list(j))
# # A tibble: 1 x 1
# a
# <list>
#   1 "<{z\"zaz\"z:2, z\"zbz\"z:4}>"

I attempted defining a method for is.character to see if I could pretend a json_str was not a character, but no luck there either. As far as I can tell, I don't see a way to cleverly dispatch a custom format method.

Error in genericForBasic(name) : 
  methods may not be defined for primitive function ‘is.character’ in this version of R

Parsing the capture.output may be the cleaner solution. Just need a better replace strategy to get the spaces back.

capture.output(print(data_frame(a='{"a":2, "b":4}'))) %>%
stringr::str_replace_all('\\\\+"','"') %>%           
writeLines()                                             
#> # A tibble: 1 x 1
#>                      a
#>                  <chr>
#> 1 "{"a":2, "b":4}"
krlmlr commented 7 years ago

format.data.frame() will call format() on the individual columns. This is how hms and many other classes work.

Would you like to document the steps that were needed to create custom output for a tibble column in a tibble vignette?

krlmlr commented 7 years ago

Note that you also need [.json_str, otherwise the head() call in trunc_mat() strips off the class. The following will do:

`[.json_str` <- function(x, ...) { structure(NextMethod(), class = "json_str") }

But even then the tibble formatting routine will reformat your data (even if you wrap it in a list), and this was the main point of your post. We'll need to find a better way here, sorry I led you to the wrong path. The new colformat is a way towards it, but it might take some time to implement.

I'm happy to add an experimental S3 method (subject to change) for formatting a column, or to review a PR that adds one. Eventually this method may move to the colformat package, or colformat may be integrated into tibble. Users of tidyjson will see escaped output in the CRAN version but better output in the development versions.

Thanks for your input.

colearendt commented 7 years ago

My pleasure, and thank you for your input as well! That is quite helpful direction - I will take a look and see if/how I am able to help. I am happy to contribute a vignette to describe the functionality as well. It sounds like a very promising feature for tibble to be extensible in this way.

colearendt commented 7 years ago

@krlmlr Pretty naive implementation on this branch. I attempted to use colformat, although I can see that the API is not really ready for that type of use. Though probably not worth a PR, I thought it might be worth sharing in case the functionality is worth exploring as colformat is developed.

The implementation is explored and some pain points are mentioned in the colformat vignette. Not sure whether there is anything worth opening as an issue / enhancement request to ensure that the thoughts are retained.

Would be interested to hear your feedback! Thanks.