colearendt / tidyjson

Tidy your JSON data in R with tidyjson
Other
182 stars 15 forks source link

Handling entering/exiting objects #95

Open pgensler opened 7 years ago

pgensler commented 7 years ago

Hello,

First off, I would like to thank you for making such a great package, as I have truly loved using this tool to work with JSON data. I am trying to parse some JSON data, and I'M running into an issue where I would like to be able to "exit" an object...See this code as an example.....

{"review/appearance": 2.5, "beer/style": "Hefeweizen", "review/palate": 1.5, "review/taste": 1.5, "beer/name": "Sausa Weizen", "review/timeUnix": 1234817823, "beer/ABV": 5.0, "beer/beerId": "47986", "beer/brewerId": "10325", "review/timeStruct": {"isdst": 0, "mday": 16, "hour": 20, "min": 57, "sec": 3, "mon": 2, "year": 2009, "yday": 47, "wday": 0}, "review/overall": 1.5, "review/text": "A lot of foam. But a lot.\tIn the smell some banana, and then lactic and tart. Not a good start.\tQuite dark orange in color, with a lively carbonation (now visible, under the foam).\tAgain tending to lactic sourness.\tSame for the taste. With some yeast and banana.", "user/profileName": "stcules", "review/aroma": 2.0} {"review/appearance": 3.0, "beer/style": "English Strong Ale", "review/palate": 3.0, "review/taste": 3.0, "beer/name": "Red Moon", "review/timeUnix": 1235915097, "beer/ABV": 6.2, "beer/beerId": "48213", "beer/brewerId": "10325", "review/timeStruct": {"isdst": 0, "mday": 1, "hour": 13, "min": 44, "sec": 57, "mon": 3, "year": 2009, "yday": 60, "wday": 6}, "review/overall": 3.0, "review/text": "Dark red color, light beige foam, average.\tIn the smell malt and caramel, not really light.\tAgain malt and caramel in the taste, not bad in the end.\tMaybe a note of honey in teh back, and a light fruitiness.\tAverage body.\tIn the aftertaste a light bitterness, with the malt and red fruit.\tNothing exceptional, but not bad, drinkable beer.", "user/profileName": "stcules", "review/aroma": 2.5}

So far, this is the code I have managed to use to extract my data....is there a way to exit the object I am trying to parse? Please let me know. I know this SO question asks about it, not sure if you have seen this: http://stackoverflow.com/questions/35198991/tidyjson-is-there-an-exit-object-equivalent/39829902#39829902

clean  <- poop %>%
  spread_values(
    review_appearance = jnumber("review/appearance"),
    beer_style = jstring("beer/style"),
    review_palate = jnumber("review/palate"),
    review_taste = jnumber("review/taste"),
    beer_name = jstring("beer/name"),
    review_time = jstring("review/timeUnix"),
    beer_ABV = jstring("beer/ABV"),
    beer_beerid = jnumber("beer/beerId"),
    beer_breweryid = jstring("beer/brewerId"),
    review_overall = jnumber("review/overall"),
    review_text = jstring("review/text"),
    profile_name = jstring("user/profileName"),
    review_aroma = jnumber("review/aroma")
  ) %>% 
enter_object("review/timeStruct") %>%  #review time is a nested object
     spread_values(
       isdst = jnumber("isdst"),
         mday = jnumber("mday"),
         hour = jnumber("hour"),
         min = jnumber("min"),
         sec = jnumber("sec"),
         mon = jnumber("mon"),
         year = jnumber("year"),
         yday = jnumber("yday"),
         wday = jnumber("wday")
       )

The only way I know to do this would be to parse the data as normal up untill that group, and then separately parse that particular object ( review/timeStruct in this case), and then append the two together. Thanks for all your hard work in putting this package together!

colearendt commented 7 years ago

I do not quite understand the reason for "exiting." It seems the code that you are working with works well (minus one issue with the beerId being a string, not a number). Is there additional processing that you would like to do after the fact that enter_object is causing problems with? Please provide a motivational example of the additional processing / functionality that would be desirable, were an exit_object() function available.

If you have a reason to maintain your "cursor" at the parent-level, so to speak, you can always use a more complex path within the j* functions. i.e. in place of your enter_object() %>% spread_values(), you can do something like:

json %>%
spread_values(
isdst = jnumber('review/timeStruct','isdst')
, mday = jnumber('review/timeStruct','mday')
...
)

EDIT: I also posted an answer on the SO post, as this functionality is a solution there as well.

pgensler commented 7 years ago

Thanks for the reply on this post, I appreciate it. I think the issue I had when working with the API was that when dealing with a nested object in my data, it seemed like I needed to use the enter_object call at the very end of my pipe sequence to parse it out, which definitely works, but I think makes the API a little bit odd to work with at times. I think an exit_object would be beneficial, so that you could do

enter_object("review/timeStruct") %>%  #review time is a nested object
     spread_values(
##next object to parse goes here 
beer_ABV = jstring("beer/ABV"),
##and so on

I'll definitely try and test the above code, and see if this works easier. I think it would be beneficial to add an example about this to your vignette, which I'm 100% OK with using this if you are.

pgensler commented 7 years ago

Let me see if I can get a better reprex of this, as it's been awhile since I posted this, and want to make sure I provide a MWE for you.

colearendt commented 7 years ago

I definitely agree that the enter_object() can be a little strange to work with in that its behavior is not reversible. The reprex is much appreciated, though, as it would be helpful to explicitly quantify the missing functionality. It may help clarify where development effort is best spent. Much of the package follows a similar framework of irreversible behavior (i.e. gather_array, append_values, etc.), so it is important to think through a handful of examples to see what sort of change serves best. Most examples I have seen are solved by this jstring('path1','path2') functionality, and maybe warrant a way to make that functionality more efficiently typed, if anything.

pgensler commented 7 years ago

So after looking through your code, I think it would help to clarify from the documentation when users should be "entering" an object, vs simply using the approach you stated above. The Vignette implies that you should be able to enter objects with ease, which is very true, but I think needs to be reframed with a better example to demonstrate the concept illustrated above. Originally, I was operating under the assumption that you should be entering into the object explicitly, as it is nested, and the only real way to accomplish parsing my data was to use the enter_object at the end of my pipe sequence. I'd love to help with refactoring the vignette, what do you think of adding this as an example? This only has one line of data, but illustrates the point:

pacman::p_load(magrittr, tidyjson,dplyr)
poop <-'{"review/appearance": 2.5, "beer/style": "Hefeweizen", "review/palate": 1.5, "review/taste": 1.5, "beer/name": "Sausa Weizen", "review/timeUnix": 1234817823, "beer/ABV": 5.0, "beer/beerId": "47986", "beer/brewerId": "10325", "review/timeStruct": {"isdst": 0, "mday": 16, "hour": 20, "min": 57, "sec": 3, "mon": 2, "year": 2009, "yday": 47, "wday": 0}, "review/overall": 1.5, "review/text": "A lot of foam. But a lot.\\tIn the smell some banana, and then lactic and tart. Not a good start.\\tQuite dark orange in color, with a lively carbonation (now visible, under the foam).\\tAgain tending to lactic sourness.\\tSame for the taste. With some yeast and banana.", "user/profileName": "stcules", "review/aroma": 2.0}'
# json needs to have \t escaped with \\t to parse properly
#note that this also does not require the use of enter_object, but spread values twice
clean  <- poop %>%
  spread_values(
    review_appearance = jnumber("review/appearance"),
    beer_style = jstring("beer/style"),
    review_palate = jnumber("review/palate"),
    review_taste = jnumber("review/taste"),
    beer_name = jstring("beer/name"),
    review_time = jstring("review/timeUnix"),
    beer_ABV = jstring("beer/ABV"),
    beer_beerid = jnumber("beer/beerId"),
    beer_breweryid = jstring("beer/brewerId"),
    review_overall = jnumber("review/overall"),
    review_text = jstring("review/text"),
    profile_name = jstring("user/profileName"),
    review_aroma = jnumber("review/aroma"),
    isdst = jnumber("review/timeStruct","isdst"),
    mday = jnumber("review/timeStruct","mday"),
    hour = jnumber("review/timeStruct","hour"),
    min = jnumber("review/timeStruct","min"),
    sec = jnumber("review/timeStruct","sec"),
    mon = jnumber("review/timeStruct","mon"),
    year = jnumber("review/timeStruct","year"),
    yday = jnumber("review/timeStruct","yday"),
    wday = jnumber("review/timeStruct","wday")
  )
dplyr::glimpse(clean)
#> Observations: 1
#> Variables: 23
#> $ document.id       <int> 1
#> $ review_appearance <dbl> 2.5
#> $ beer_style        <chr> "Hefeweizen"
#> $ review_palate     <dbl> 1.5
#> $ review_taste      <dbl> 1.5
#> $ beer_name         <chr> "Sausa Weizen"
#> $ review_time       <chr> "1234817823"
#> $ beer_ABV          <chr> "5"
#> $ beer_beerid       <dbl> 47986
#> $ beer_breweryid    <chr> "10325"
#> $ review_overall    <dbl> 1.5
#> $ review_text       <chr> "A lot of foam. But a lot.\tIn the smell som...
#> $ profile_name      <chr> "stcules"
#> $ review_aroma      <dbl> 2
#> $ isdst             <dbl> 0
#> $ mday              <dbl> 16
#> $ hour              <dbl> 20
#> $ min               <dbl> 57
#> $ sec               <dbl> 3
#> $ mon               <dbl> 2
#> $ year              <dbl> 2009
#> $ yday              <dbl> 47
#> $ wday              <dbl> 0
colearendt commented 7 years ago

I definitely agree that ensuring this sort of behavior is documented and easily accessible is a great call. The vignette is a good place for it as well. For efficiency, I also think this is a worthwhile construct to consider - auto-generating the column names and then correcting as needed.

I am not sure that this functionality is on the CRAN version yet - but you can acquire it using devtools::install_github('jeremystan/tidyjson').

json <- "{\"review/appearance\": 2.5, \"beer/style\": \"Hefeweizen\", \"review/palate\": 1.5, \"review/taste\": 1.5, \"beer/name\": \"Sausa Weizen\", \"review/timeUnix\": 1234817823, \"beer/ABV\": 5.0, \"beer/beerId\": \"47986\", \"beer/brewerId\": \"10325\", \"review/timeStruct\": {\"isdst\": 0, \"mday\": 16, \"hour\": 20, \"min\": 57, \"sec\": 3, \"mon\": 2, \"year\": 2009, \"yday\": 47, \"wday\": 0}, \"review/overall\": 1.5, \"review/text\": \"A lot of foam. But a lot.\\tIn the smell some banana, and then lactic and tart. Not a good start.\\tQuite dark orange in color, with a lively carbonation (now visible, under the foam).\\tAgain tending to lactic sourness.\\tSame for the taste. With some yeast and banana.\", \"user/profileName\": \"stcules\", \"review/aroma\": 2.0}"

d <- json %>% spread_all()

## Rename removing 'review/timeStruct' - presuming without checking
## uniqueness
n <- names(d)
names(d) <- n %>% stringr::str_replace("review/timeStruct\\.", "")
dplyr::glimpse(d)
#> Observations: 1
#> Variables: 23
#> $ document.id       <int> 1
#> $ review/appearance <dbl> 2.5
#> $ beer/style        <chr> "Hefeweizen"
#> $ review/palate     <dbl> 1.5
#> $ review/taste      <dbl> 1.5
#> $ beer/name         <chr> "Sausa Weizen"
#> $ review/timeUnix   <dbl> 1234817823
#> $ beer/ABV          <dbl> 5
#> $ beer/beerId       <chr> "47986"
#> $ beer/brewerId     <chr> "10325"
#> $ review/overall    <dbl> 1.5
#> $ review/text       <chr> "A lot of foam. But a lot.\tIn the smell som...
#> $ user/profileName  <chr> "stcules"
#> $ review/aroma      <dbl> 2
#> $ isdst             <dbl> 0
#> $ mday              <dbl> 16
#> $ hour              <dbl> 20
#> $ min               <dbl> 57
#> $ sec               <dbl> 3
#> $ mon               <dbl> 2
#> $ year              <dbl> 2009
#> $ yday              <dbl> 47
#> $ wday              <dbl> 0
pgensler commented 7 years ago

@colearendt Yeah to be honest, I'm not really sure how good of an example this is, but I just want to make understandable so others can use the API with ease, and not have to resort to esoteric methods such as the following (which I think could work):

What are you thoughts on using purrr to parse the data instead of relying upon lots of specialized functions? In other words, using purrr to accelerate the core "verbs".


json <- "{\"review/appearance\": 2.5, \"beer/style\": \"Hefeweizen\", \"review/palate\": 1.5, \"review/taste\": 1.5, \"beer/name\": \"Sausa Weizen\", \"review/timeUnix\": 1234817823, \"beer/ABV\": 5.0, \"beer/beerId\": \"47986\", \"beer/brewerId\": \"10325\", \"review/timeStruct\": {\"isdst\": 0, \"mday\": 16, \"hour\": 20, \"min\": 57, \"sec\": 3, \"mon\": 2, \"year\": 2009, \"yday\": 47, \"wday\": 0}, \"review/overall\": 1.5, \"review/text\": \"A lot of foam. But a lot.\\tIn the smell some banana, and then lactic and tart. Not a good start.\\tQuite dark orange in color, with a lively carbonation (now visible, under the foam).\\tAgain tending to lactic sourness.\\tSame for the taste. With some yeast and banana.\", \"user/profileName\": \"stcules\", \"review/aroma\": 2.0}"
library(tidyjson)
map(json, spread_values)
colearendt commented 7 years ago

@pgensler Very interesting thought. It seems to me that the expectation would be for map(json, spread_values) to behave very much like spread_all does (greedily grab all key-value pairs, auto-generate column names, etc.)? And in that case, I think I prefer tidyjson's approach to keep spread_all and spread_values distinct. spread_values, by comparison, will always return the desired data.frame, regardless of what the input JSON provides, which is oftentimes useful for programming.

library(tidyjson)

## using spread_all
"{\"a\": 1, \"b\": 2, \"c\": 3}" %>% spread_all()
#> # A tbl_json: 1 x 4 tibble with a "JSON" attribute
#>           `attr(., "JSON")` document.id     a     b     c
#>                       <chr>       <int> <dbl> <dbl> <dbl>
#> 1 "{\"a\":1,\"b\":2,\"c..."           1     1     2     3

## using spread_values (same output)
"{\"a\": 1, \"b\": 2, \"c\": 3}" %>% spread_values(a = jnumber(a), b = jnumber(b), 
  c = jnumber(c))

#> # A tbl_json: 1 x 4 tibble with a "JSON" attribute
#>           `attr(., "JSON")` document.id     a     b     c
#>                       <chr>       <int> <dbl> <dbl> <dbl>
#> 1 "{\"a\":1,\"b\":2,\"c..."           1     1     2     3

## using spread_values (with bad input)
"{}" %>% spread_values(a = jnumber(a), b = jnumber(b), c = jnumber(c))

#> # A tbl_json: 1 x 4 tibble with a "JSON" attribute
#>   `attr(., "JSON")` document.id     a     b     c
#>               <chr>       <int> <dbl> <dbl> <dbl>
#> 1                {}           1    NA    NA    NA

## using spread_all (with bad input)
"{}" %>% spread_all()
#> # A tbl_json: 1 x 1 tibble with a "JSON" attribute
#>   `attr(., "JSON")` document.id
#>               <chr>       <int>
#> 1                {}           1

I think you make a good point, though, that spread_values can oftentimes require a lot of typing. As a supplement to spread_all (which is helpful for interactive use), I thought it might be nice to have a way of vectorizing for the purposes of programming and reducing the amount of typing. purrr might be a nice help in that regard. I also think readr has some helpful pointers at vectorizing column-types, which is essentially what we are looking for.

If you haven't installed and explored the development version of tidyjson with devtools::install_github('jeremystan/tidyjson'), I definitely recommend it! There is a helpful amount of new development, updated vignettes, etc.

colearendt commented 7 years ago

Interesting example from this SO post where some way of dealing with multiple arrays in parallel would be helpful to have. The workaround by splitting into separate objects and then combining with a left_join is less than ideal. In particular, see AddressLine and CityCode arrays.

raw_json <- "{
  \"ShipmentID\" :  \"0031632569\",
   \"ShipmentType\" :  \"Cross-border\",
   \"ShipmentStatus\" :  \"Final\",
   \"PartyInfo\" :  [
    {
       \"Type\" :  \"Consignee\",
       \"Code\" :  \"0590000001\",
       \"Name\" :  \"HP Inc. C\/O XPOLogistics\",
       \"Address\":  {
         \"AddressLine\" :  [
           \"4000 Technology Court\" 
        ] 
      },
       \"City\" :  {
         \"CityName\" :  \"Sandston\",
         \"CityCode\" :  [
           {
             \"value\" :  \"USSAX\",
             \"Qualifier\" :  \"UN\" 
          } 
        ],
         \"State\" :  \"VA\",
         \"CountryCode\" :  \"US\",
         \"CountryName\" :  \"United States\" 
      }
    }
  ]
}"