Open colearendt opened 7 years ago
I'm not familiar with this package, but I wonder why as.tbl_json()
keeps the quotes. Would it be an option to use a list to store data of varying type?
Oh yes, the JSON data is stored in a list at attr(., 'JSON')
, but the print.tbl_json
method presents the quotes in a column because of JSON's syntax that heavily uses double quotes (for strings, keys, etc.). The column is manufactured in the print method for easy viewing of the raw JSON along with how it relates to the tbl_df
. Perhaps a more illustrative example of the problem in this context:
json <- '{"first": "frodo", "title": "hobbit"}'
tj <- as.tbl_json(json)
print(tj)
#> # A tbl_json: 1 x 1 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id
#> <chr> <int>
#> 1 "{\"first\":\"frodo..." 1
print(attr(tj,'JSON'))
#> [[1]]
#> [[1]]$first
#> [1] "frodo"
#>
#> [[1]]$title
#> [1] "hobbit"
The issue I opened with tibble
is really a question of whether tidyjson
should change it's print
method to accommodate the new parsing behavior, or if the legacy behavior is preferable throughout. The previous print method was much cleaner. Also included an example of the use of the package to build a simple tibble:
library(tidyjson)
json <- "{\"first\": \"frodo\", \"title\": \"hobbit\"}"
tj <- as.tbl_json(json)
print(tj)
#> # A tbl_json: 1 x 1 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id
#> <chr> <int>
#> 1 {"first":"frodo... 1
## Using the package
tj %>% spread_all()
#> # A tbl_json: 1 x 3 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id first title
#> <chr> <int> <chr> <chr>
#> 1 {"first":"frodo... 1 frodo hobbit
tj %>% gather_object %>% append_values_string
#> # A tbl_json: 2 x 3 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id name string
#> <chr> <int> <chr> <chr>
#> 1 "frodo" 1 first frodo
#> 2 "hobbit" 1 title hobbit
Maybe define a simple json_str
class with a custom format()
method?
Ahh interesting - that makes sense. So I expect that would override the print.tbl_df
's handling of the column, were it of that class?
Wow. Very nice. I will explore - much appreciated! I like that solution better than parsing the output of print.tbl_df
after using capture.output
. Of course, I am not the package maintainer, so I cannot speak for him.
@krlmlr Any thoughts/direction on the proposed implementation? I read through utils-format.R
for tibble
and it does not seem that the format
method of any column is called, so I am not quite sure how it would be possible to do as you suggested. For list-columns, the list is printed with type_sum
or obj_sum
. I had a bit of luck defining type_sum.json_str
, but double quotes are still escaped.
j <- structure('{"a":2, "b":4}', class='json_str')
format.json_str <- function(x, ...) {
return(stringr::str_replace_all(x,'"','Z'))
}
## works... not in the tibble
format(j)
#> [1] "{ZaZ:2, ZbZ:4}"
data_frame(a=j)
#> # A tibble: 1 x 1
#> a
#> <chr>
#> 1 "{\"a\":2, \"b\":4}"
## more promising
type_sum.json_str <- function(x) { return(stringr::str_replace_all(x,'"','Z'))}
data_frame(a=list(j))
# # A tibble: 1 x 1
# a
# <list>
# 1 <{ZaZ:2, ZbZ:4}>
## still does not work with quotes
type_sum.json_str <- function(x) { return(stringr::str_replace_all(x,'\\"','z"z'))}
data_frame(a=list(j))
# # A tibble: 1 x 1
# a
# <list>
# 1 "<{z\"zaz\"z:2, z\"zbz\"z:4}>"
I attempted defining a method for is.character
to see if I could pretend a json_str
was not a character, but no luck there either. As far as I can tell, I don't see a way to cleverly dispatch a custom format method.
Error in genericForBasic(name) :
methods may not be defined for primitive function ‘is.character’ in this version of R
Parsing the capture.output
may be the cleaner solution. Just need a better replace strategy to get the spaces back.
capture.output(print(data_frame(a='{"a":2, "b":4}'))) %>%
stringr::str_replace_all('\\\\+"','"') %>%
writeLines()
#> # A tibble: 1 x 1
#> a
#> <chr>
#> 1 "{"a":2, "b":4}"
format.data.frame()
will call format()
on the individual columns. This is how hms
and many other classes work.
Would you like to document the steps that were needed to create custom output for a tibble column in a tibble vignette?
Note that you also need [.json_str
, otherwise the head()
call in trunc_mat()
strips off the class. The following will do:
`[.json_str` <- function(x, ...) { structure(NextMethod(), class = "json_str") }
But even then the tibble formatting routine will reformat your data (even if you wrap it in a list), and this was the main point of your post. We'll need to find a better way here, sorry I led you to the wrong path. The new colformat is a way towards it, but it might take some time to implement.
I'm happy to add an experimental S3 method (subject to change) for formatting a column, or to review a PR that adds one. Eventually this method may move to the colformat package, or colformat may be integrated into tibble. Users of tidyjson will see escaped output in the CRAN version but better output in the development versions.
Thanks for your input.
My pleasure, and thank you for your input as well! That is quite helpful direction - I will take a look and see if/how I am able to help. I am happy to contribute a vignette to describe the functionality as well. It sounds like a very promising feature for tibble to be extensible in this way.
@krlmlr Pretty naive implementation on this branch. I attempted to use colformat
, although I can see that the API is not really ready for that type of use. Though probably not worth a PR, I thought it might be worth sharing in case the functionality is worth exploring as colformat
is developed.
The implementation is explored and some pain points are mentioned in the colformat vignette. Not sure whether there is anything worth opening as an issue / enhancement request to ensure that the thoughts are retained.
Would be interested to hear your feedback! Thanks.
At present, tests are failing for print.tbl_json due to tidyverse/tibble#259 since the new
tibble
release, I presume.