colearendt / tidyjson

Tidy your JSON data in R with tidyjson
Other
184 stars 15 forks source link

`json_types()` has inconsistent/undocumented behaviour #147

Open cynthiahqy opened 1 month ago

cynthiahqy commented 1 month ago

Hi -- first up, thank you so much for this package! It's a great idea with lots of tricky details that I think you've dealt with really well.

I noticed that json_types() silently overwrites existing any existing column named type, unlike gather_object() which adds an increment to the default column name name and warns the user:

library(tidyjson)
#> 
#> Attaching package: 'tidyjson'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

## create sample json array
wb_1 <- worldbank[1] |>
    json_structure() |>
    filter(level == 1 & type == "array")

## type column: array
wb_1 |>
    gather_array()
#> # A tbl_json: 4 x 11 tibble with a "JSON" attribute
#>   ..JSON    document.id parent.id level index child.id seq    name  type  length
#>   <chr>           <int> <chr>     <int> <int> <chr>    <list> <chr> <fct>  <int>
#> 1 "{\"Name…           1 1             1     5 1.5      <list> majo… array      4
#> 2 "{\"Name…           1 1             1     5 1.5      <list> majo… array      4
#> 3 "{\"Name…           1 1             1     5 1.5      <list> majo… array      4
#> 4 "{\"Name…           1 1             1     5 1.5      <list> majo… array      4
#> # ℹ 1 more variable: array.index <int>

## existing `type` column is silently overwritten
## to reflect types of elements in array (i.e. each row)
wb_1 |>
    gather_array() |>
    json_types()
#> # A tbl_json: 4 x 11 tibble with a "JSON" attribute
#>   ..JSON    document.id parent.id level index child.id seq    name  type  length
#>   <chr>           <int> <chr>     <int> <int> <chr>    <list> <chr> <fct>  <int>
#> 1 "{\"Name…           1 1             1     5 1.5      <list> majo… obje…      4
#> 2 "{\"Name…           1 1             1     5 1.5      <list> majo… obje…      4
#> 3 "{\"Name…           1 1             1     5 1.5      <list> majo… obje…      4
#> 4 "{\"Name…           1 1             1     5 1.5      <list> majo… obje…      4
#> # ℹ 1 more variable: array.index <int>

## but this is inconsistent with
## gather_object() which adds a new `name` column
## and warns the user:
# Warning message:
#In gather_object(json_types(gather_array(wb_1))) :
#  name column name already exists, changing to name.2
wb_1 |>
    gather_array() |>
    json_types() |>
    gather_object()
#> Warning in gather_object(json_types(gather_array(wb_1))): name column name
#> already exists, changing to name.2
#> # A tbl_json: 8 x 12 tibble with a "JSON" attribute
#>   ..JSON    document.id parent.id level index child.id seq    name  type  length
#>   <chr>           <int> <chr>     <int> <int> <chr>    <list> <chr> <fct>  <int>
#> 1 "\"Educa…           1 1             1     5 1.5      <list> majo… obje…      4
#> 2 "46"                1 1             1     5 1.5      <list> majo… obje…      4
#> 3 "\"Educa…           1 1             1     5 1.5      <list> majo… obje…      4
#> 4 "26"                1 1             1     5 1.5      <list> majo… obje…      4
#> 5 "\"Publi…           1 1             1     5 1.5      <list> majo… obje…      4
#> 6 "16"                1 1             1     5 1.5      <list> majo… obje…      4
#> 7 "\"Educa…           1 1             1     5 1.5      <list> majo… obje…      4
#> 8 "12"                1 1             1     5 1.5      <list> majo… obje…      4
#> # ℹ 2 more variables: array.index <int>, name.2 <chr>

Created on 2024-10-08 with reprex v2.0.2

Would it be possible to modify json_types() to behave consistently with gather_object() -- i.e. to NOT overwrite the existing type column, but instead append a new column type.2 if json_types() is called multiple times?