Closed dcooley closed 4 years ago
Lovely :)
Fair point re a possible interface package but I think it would be lovely to have this and expose it. This turned also into a really neat conversation with upstream who may enjoy seeing the feature being present and providing, if you wish, extra 'test coverage' because some more user data may be coming this way.
I started playing with simdjson + Rcpp back in September(?), but I clearly didn't make much public progress: https://github.com/knapply/simdjsonr
@dcooley, I haven't spent any time on a "simplify" workflow (I just don't really use it myself), but now that I've seen the JSON Pointer part in action, I think simdjson is a total game-changer.
I have no idea how I missed the Pointer functionality in RapidJSON, but I barely knew what I was doing in C++ when I was working with it more regularly (not that I know what I'm doing 7ish months later). Regardless, it has already made working with enormous (ND)JSON(L) data sets from R actually viable.
All that said, it doesn't make much sense to do my own thing elsewhere, especially since this is already on CRAN and rapport with the folks upstream has been established.
The approach I'm using follows (I just dropped them in gists instead of pushing a a bunch of garbage to the old repo), I'll start aggregating things into my fork for proper PRs.
Dave, I suspect some combination of our approaches may make sense 🤷♂, but it's a pretty safe bet you've spent more time thinking about JSON (and I default to assuming my C++ code is a ticking time bomb).
https://gist.github.com/knapply/0cfda08e85ba3fa4f7e61071f83d4768
... and simdjson_parse.md has the R wrapper function and some examples of what it looks like in action...
simdjson_parse <- function(x, json_pointer = "",
int64 = c("auto", "integer64", "string", "double"),
error_on_bad_parse = TRUE) {
int64 <- match.arg(int64, c("auto", "integer64", "string", "double"))
if (int64 %in% c("auto", "integer64")) {
bit64_available <- requireNamespace("bit64", quietly = TRUE)
if (int64 == "integer64" && !bit64_available) {
stop('`int64` set to `"integer64"`, but {bit64} is not installed.')
}
if (bit64_available) { # int64_t as bit64::integer64
out <- .parse_json_impl(
json = x, json_pointer,
bit64_integer64 = TRUE, int_64_strings = FALSE,
error_on_bad_parse = error_on_bad_parse
)
} else {
int64 = "string"
}
}
if (int64 == "string") { # int64_t as character
out <- .parse_json_impl(
json = x, json_pointer,
bit64_integer64 = FALSE, int_64_strings = TRUE,
error_on_bad_parse = error_on_bad_parse
)
} else { # int64_t as double
out <- .parse_json_impl(
json = x, json_pointer,
bit64_integer64 = FALSE, int_64_strings = FALSE,
error_on_bad_parse = error_on_bad_parse
)
}
if (length(out) > 1L) out else out[[1L]]
}
simdjson_parse("[]")
## list()
simdjson_parse("{}")
## named list()
simdjson_parse('{"simd":["j","s","o","n"]}')
## $simd
## $simd[[1]]
## [1] "j"
##
## $simd[[2]]
## [1] "s"
##
## $simd[[3]]
## [1] "o"
##
## $simd[[4]]
## [1] "n"
simdjson_parse(c("bad_json", '{"good_json":true}'))
## Error in .parse_json_impl(json = x, json_pointer, bit64_integer64 = TRUE, : parse error
simdjson_parse(c("bad_json", '{"good_json":true}'), error_on_bad_parse = FALSE)
## Warning in .parse_json_impl(json = x, json_pointer, bit64_integer64 = TRUE, :
## parse error
## [[1]]
## NULL
##
## [[2]]
## [[2]]$good_json
## [1] TRUE
simdjson_parse('{"ints":[1,2,3]}')
## $ints
## $ints[[1]]
## [1] 1
##
## $ints[[2]]
## [1] 2
##
## $ints[[3]]
## [1] 3
is.integer(unlist(simdjson_parse('{"ints":[1,2,3]}')))
## [1] TRUE
simdjson_parse('{"big_int":1178007955838509057}')
## $big_int
## integer64
## [1] 1178007955838509057
simdjson_parse('{"big_int":2356015911677018114}', int64 = "string")
## $big_int
## [1] "2356015911677018114"
simdjson_parse('{"big_int":3534023867515527171}', int64 = "double")
## $big_int
## [1] 3.534024e+18
simdjson_parse(
'{"big_ints":[{"a":1178007955838509057,"b":2356015911677018114,"c":[2356015911677018114,4712031823354036228]}]}',
json_pointer = "big_ints/0/c/1"
)
## integer64
## [1] 4712031823354036228
tweet_json <- readr::read_lines("../tweetio/inst/example-data/ufc-tweet-stream.json")
test_json <- tweet_json[vapply(tweet_json, jsonlite::validate, logical(1L))]
length(test_json)
## [1] 100000
library(jsonlite)
# library(jsonify, warn.conflicts = FALSE)
bench::mark(
simdjson = simdjson <- simdjson_parse(test_json),
fairer_simdjson = fairer_simdjson <- lapply(test_json, simdjson_parse)
# jsonify = jsonify <- lapply(test_json, from_json, simplify = FALSE) # sefgaults when knitting...?
,
jsonlite = jsonlite <- lapply(test_json, parse_json)
,
check = FALSE
)
## Warning: Some expressions had a GC in every iteration; so filtering is disabled.
## # A tibble: 3 x 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 simdjson 6.85s 6.85s 0.146 243MB 0.146
## 2 fairer_simdjson 11.02s 11.02s 0.0907 474MB 0.0907
## 3 jsonlite 41.89s 41.89s 0.0239 234MB 0.0239
simdjson[[200]]$entities$user_mentions[[1]][c("id", "id_str", "indices")]
## $id
## integer64
## [1] 3230043854
##
## $id_str
## [1] "3230043854"
##
## $indices
## $indices[[1]]
## [1] 3
##
## $indices[[2]]
## [1] 14
# jsonify[[200]]$entities$user_mentions[[1]][c("id", "id_str", "indices")]
jsonlite[[200]]$entities$user_mentions[[1]][c("id", "id_str", "indices")]
## $id
## [1] 3230043854
##
## $id_str
## [1] "3230043854"
##
## $indices
## $indices[[1]]
## [1] 3
##
## $indices[[2]]
## [1] 14
yeah I've been focussed on getting the correct R object for the given JSON, which includes the simplification processes. And I haven't so much concentrated on performance.
here are a few tests and examples. Currently some int64
s are returned to R as numeric
, so that needs to be handled, but most of the logic for returning the correct R structure is there.
A quick* benchmark suggests theres some overhead I haven't accounted for, as this implementation is currently slower than jsonify
library(jsonify)
library(jsonlite)
library(RcppSimdJson)
library(microbenchmark)
js <- readLines('http://opendata.canterburymaps.govt.nz/datasets/fb00b553120b4f2fac49aa76bc8d82aa_26.geojson')
js <- paste0(js, collapse = "")
microbenchmark::microbenchmark(
jsonify = { jfy <- jsonify::from_json( js ) },
jsonlite = { jlt <- jsonlite::fromJSON( js ) },
simdjson = { sim <- RcppSimdJson::from_json( js ) },
times = 5
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# jsonify 138.4742 139.7152 140.6192 139.7596 141.3846 143.7623 5
# jsonlite 1230.4436 1232.6133 1256.6139 1251.1161 1267.9963 1300.9003 5
# simdjson 201.8796 202.3721 203.7961 202.5413 204.1732 208.0143 5
@dcooley , try benchmarking a non-geojson file: https://github.com/simdjson/simdjson/blob/master/doc/performance.md#number-parsing
I'm trying to figure out how to get this to pass R CMD check with the latest simdjson (this one is still missing .size()
for dom::object
s), but it seems that it's going to require changes upstream (stderr and abort calls).
Now that RTools40 is considered stable, this should be viable for Windows R users. It's worth noting that a string_view
implementation is now bundled (although it's unclear to me what the minimum GCC version is). I'm all for ditching C++11 from the start, but it's worth considering.
@eddelbuettel Are you opposed to RcppSimdJson
housing the R/Rcpp interface while the simdjson library itself is self-contained (like how @dcooley approached rapidjsonr
+ jsonify
)? There are pros and cons, but I've found this dependency-free header silo + R/Rcpp interface paradigm to be extremely useful.
stderr
and abort
: maybe talk to upstream, I seem to recall my last few updates were simple after I explained the issue, and it was my understanding that simdjson would not revert (some statements to the "no i/o in library" etc pp)
are you opposed: I am not sure what it is you are proposing. This package exists, and already provides the simdjson header via CRAN. I would suggest to keep it that way. @dcooley and I don't seem to have a problem adding functionality to the src/
and R/
directories of this package. So can you maybe reword your suggestion? Thx!
Let me rephrase. The way I see it we have a few options:
I don't have any other package where header and use are split. We could do that but I don't yet see a really compelling reason besides "well we can". But I may miss something. In any event we can revisit...
maybe talk to upstream
That's the plan. I'd just like to have a solution ready first.
We could do that but I don't yet see a really compelling reason besides "well we can". But I may miss something.
The reason is flexibility. The omnipresence of JSON extends to environments and systems with all kinds of requirements; some valid, some nonsensical.
In any event we can revisit...
I was thinking that splitting now (while it has a minimal amount of users) would prevent disruptive headaches later. After more consideration, keeping things together is probably safer: simdjson itself is relatively young and it's clearly evolving... and I suppose copying the two amalgamated files still works in a pinch.
Right. I still think keeping it as one is preferable, the whole may offer more. I missed that chance with CCTZ (wrapped as RcppCCTZ) and now the source linger in three other packages for no benefit. It's somewhat suboptimal.
although it's unclear to me what the minimum GCC version is
Please see
https://github.com/simdjson/simdjson/blob/master/doc/basics.md#requirements
stderr and abort: maybe talk to upstream
There is no use of stderr or abort in the main library as far as we know. If there is, please report it as a bug.
Follow-up: abort and stderr did get back into the library. The problem is that we did not have tests. I have added such tests this time around so it should stay away.
I said it last time, I say it again: really really appreciate that. Makes our downstream work a lot easier.
@eddelbuettel Yes. Removing offending code is easy. Tracking new commits and checking every line to make sure that we don't fall back is harder. This sort of work needs to be automated.
@lemire Thank you so much!
https://github.com/simdjson/simdjson/pull/893 has not yet been merged, but I took the new amalgamation for a test drive in my fork.
@dcooley I pulled your changes and added is_really_int64_t()
and resolve_int()
here, then swapped all the integer get-ters for resolve_int()
.
I think a portion of the overhead you're seeing is coming from redundant checks. Specifically, if types are confirmed with simdjson::dom::element::is()
or dom::element::type()
, the value can be safely extracted with T(dom::element)
, T = dom::element
, or even Rcpp::wrap<T>(dom::element)
. Extracting elements with dom::element::get()
and dom::array::at()
is going to slow things down because they return simdjson_result
s and can throw exceptions. I certainly have some more exploring to do though.
After making a few other small modifications, RcppSimdJson builds on Linux, Mac, and Windows w/ 100% passing on the tests @dcooley referenced at https://github.com/eddelbuettel/rcppsimdjson/issues/10#issuecomment-629918466 via Github Actions. The only R CMD Check Warnings are coming from the undocumented exports.
@eddelbuettel , if you want to keep CI to only Travis + Docker, please say the word.
if you want to keep CI to only Travis + Docker, please say the word.
"word"
I looked into the alternatives, and remain content with Travis CI.
Note that simdjson can be used with or without exceptions. We have two distinct "sub-API" depending on the mode you are using. It is possible to control this with macros and it depends in part on how you compile the library (with or without exceptions). So you definitively do not have to deal with exceptions if you do not want to. It is usually the case the relying on exceptions comes with a performance overhead.
@knapply
I pulled your changes and added is_really_int64_t() and resolve_int() here, then swapped all the integer get-ters for resolve_int() .
Do you want to make a PR with with these changes included, as well as anything else you've got covered in your other PR ?
try benchmarking a non-geojson file:
Yeah there's definitely something wrong with the way I'm using simdjson
given these results
library(jsonify)
library(jsonlite)
library(RcppSimdJson)
library(microbenchmark)
n <- 1e5
df <- data.frame(
x = 1:n
, y = sample( letters, size = n, replace = T)
)
js <- jsonify::to_json(df)
microbenchmark::microbenchmark(
jsonify = { jfy <- jsonify::from_json( js ) },
jsonlite = { jlt <- jsonlite::fromJSON( js ) },
simdjson = { sim <- RcppSimdJson::from_json( js ) },
times = 5
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# jsonify 247.4799 285.1659 335.8276 366.9796 373.1176 406.3950 5
# jsonlite 289.5658 292.8082 335.9969 300.2588 325.5345 471.8174 5
# simdjson 37025.7764 37443.0639 37941.6096 37634.0160 38023.0598 39582.1316 5
You are taking 38 seconds to parse about 2MB of data? That's just not possible.
It is a bit difficult to reason in milliseconds. It is easier if you break it down in, say, GB/s.
I am not a R user, so I tried to guess what the script would generate... and I implemented it in Python:
import string
lower_upper_alphabet = string.ascii_letters
import random
def randomletter():
return random.choice(lower_upper_alphabet)
print("[",end="")
for i in range(1,100000):
print("{\"id\":"+str(i)+",\"val\":\""+randomletter()+"\"},",end="")
print("{\"id\":"+str(i)+",\"val\":\""+randomletter()+"\"}",end="")
print("]",end="\n")
This generate a crazy file which I called crazy.json. Then I ran a benchmark over it...
$ ./benchmark/parsingcompetition ../crazy.json
simdjson (dynamic mem) : 4.485 cycles per input byte (best) 4.957 cycles (avg) 0.760 GB/s (error margin: 0.072 GB/s) 332 documents/s (best) 300 documents/s (avg)
simdjson : 4.486 cycles per input byte (best) 4.511 cycles (avg) 0.760 GB/s (error margin: 0.004 GB/s) 332 documents/s (best) 330 documents/s (avg)
RapidJSON : 17.655 cycles per input byte (best) 17.766 cycles (avg) 0.193 GB/s (error margin: 0.001 GB/s) 84 documents/s (best) 84 documents/s (avg)
RapidJSON (accurate number parsing) : 19.179 cycles per input byte (best) 19.202 cycles (avg) 0.178 GB/s (error margin: 0.000 GB/s) 78 documents/s (best) 78 documents/s (avg)
RapidJSON (insitu) : 15.888 cycles per input byte (best) 15.945 cycles (avg) 0.214 GB/s (error margin: 0.001 GB/s) 94 documents/s (best) 93 documents/s (avg)
RapidJSON (insitu, accurate number parsing) : 17.798 cycles per input byte (best) 17.827 cycles (avg) 0.191 GB/s (error margin: 0.000 GB/s) 84 documents/s (best) 84 documents/s (avg)
So we achieve ~0.75 GB/s which is very low for simdjson, but it is a somewhat adversarial (synthetic example) case.
Ok. So let us turn this into milliseconds. My file spans 2288896 bytes. So we have 0.2% of a GB... We need to divide this by 0.75 GB to get the time in second, and then multiply again by 1000 to get the number of milliseconds... 2288896/1000000000./0.75 * 1000 which is 3 milliseconds.
So I would expect simdjson to take about 3 milliseconds to parse this file. Of course, there may be overhead that I am not aware of...
But there is no possible way that it goes up to 38 seconds.
try benchmarking a non-geojson file
With canada.json
(a geojson file), which is one of our standard test file, we get better than 0.8 GB/s on a 3.4 GHz Skylake processor. A 3.4 GHz Skylake is really quite ordinary at this point.
I don't think it is possible to build an input JSON such that it would take 38 seconds to parse 3 MB... I would argue that no non-broken JSON parser could possibly be that slow.
Let's keep it apples to apples. It is no longer parse speed alone.
@dcooley is trying to build a data structure to return to R, and we typically have a few constraints on the way (having a limited set of types is one). So there will be copies, and in phase one there may be extra copies. Such is life. I trust Dave who has has put together amazing stuff (off JSON input) for the mapdeck viz. Let's not quite shoot with real bullets yet.
@dcooley , I think I have a reasonable workflow for the integer stuff that won't be regretted (too badly) later that I brought up in #13. I just want to confirm that's the desired direction before it invades any code.
I'm still trying to grock what the simplify
rules exactly are, but there will definitely be a speed jump by separating concerns between the simplify and "vanilla" routine.
.parse_json()
is a "refined" version of what I discussed earlier in the thread.
It's meant to be a clone of jsonlite::parse_json()
(I haven't actually looked at its internals), with its default arguments (so no simplify), so I'm sure having less branches helps it zip along. I'm kinda amazed that, so far, its results are identical()
to jsonlite::parse_json()
's.
Of course, this is the "easy" part; you're tackling a lot more with from_json(simplify = TRUE)
...
js <- paste0(readLines("https://github.com/zemirco/sf-city-lots-json/raw/master/citylots.json"),
collapse = "")
pryr::object_size(js)
#> 189 MB
microbenchmark::microbenchmark(
jsonlite = jsonlite::parse_json(js),
simdjson = RcppSimdJson:::.parse_json(js)
,
times = 1,
check = "identical"
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> jsonlite 4.402949 4.402949 4.402949 4.402949 4.402949 4.402949 1
#> simdjson 1.212326 1.212326 1.212326 1.212326 1.212326 1.212326 1
rcppsimdjson_dir <- "~/Documents/rcppsimdjson/inst/jsonexamples/"
json_file_paths <- dir(rcppsimdjson_dir, pattern = "\\.json$", full.names = TRUE)
names(json_file_paths) <- dir(rcppsimdjson_dir, pattern = "\\.json$")
jsons <- vapply(
json_file_paths,
function(.x) paste0(readLines(.x, warn = FALSE), collapse = ""),
character(1L)
)
bench_marks <- mapply(
function(.x, .y) {
res <- microbenchmark::microbenchmark(
jsonlite = jsonlite::parse_json(.x),
simdjson = RcppSimdJson:::.parse_json(.x)
,
times = 10,
unit = "ms",
check = "identical"
)
cat("********************** ", .y, "\n")
print(res, order = "median")
cat("\n\n")
cbind(data.frame(file_name = .y), as.data.frame(res))
},
jsons,
names(jsons),
SIMPLIFY = FALSE
)
#> ********************** apache_builds.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 0.434170 0.451185 0.5107266 0.4735555 0.592085 0.608644 10
#> jsonlite 1.142164 1.173881 1.3241093 1.2536620 1.310450 2.024043 10
#>
#>
#> ********************** canada.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 7.230551 7.456116 7.875385 7.849235 8.224713 8.706415 10
#> jsonlite 42.288008 43.653252 44.711665 44.029844 46.164748 47.662902 10
#>
#>
#> ********************** citm_catalog.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 3.638028 3.704805 4.300136 3.886545 5.053951 5.899548 10
#> jsonlite 22.725979 22.934923 23.873911 23.696750 24.524041 25.694399 10
#>
#>
#> ********************** github_events.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 0.195300 0.207428 0.2620517 0.2547240 0.280591 0.427534 10
#> jsonlite 0.918447 0.960214 0.9794883 0.9879705 1.010506 1.017453 10
#>
#>
#> ********************** gsoc-2018.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 7.289759 7.891569 8.145152 7.990608 8.276493 9.496555 10
#> jsonlite 13.240555 13.401006 14.031844 13.863335 14.754366 15.289081 10
#>
#>
#> ********************** instruments.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 0.587803 0.589676 0.694585 0.6076295 0.811805 0.892138 10
#> jsonlite 2.040344 2.153086 2.228978 2.1947955 2.283114 2.598088 10
#>
#>
#> ********************** marine_ik.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 12.97164 13.33626 13.82794 13.56599 13.6018 16.91775 10
#> jsonlite 72.88100 73.37122 76.34738 75.26905 77.1654 87.12118 10
#>
#>
#> ********************** mesh.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 2.305229 2.391161 2.618197 2.558874 2.82882 3.046341 10
#> jsonlite 19.313140 19.674100 20.896420 21.152274 21.49891 23.122117 10
#>
#>
#> ********************** mesh.pretty.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 2.875675 2.90280 3.047431 3.013965 3.160781 3.32808 10
#> jsonlite 19.739430 20.92576 25.905227 27.926010 29.914193 30.26666 10
#>
#>
#> ********************** numbers.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 0.339428 0.341582 0.3677705 0.3715365 0.377728 0.417071 10
#> jsonlite 2.934479 2.941936 3.0214713 2.9529865 2.987781 3.581713 10
#>
#>
#> ********************** random.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 2.483639 2.506914 2.905289 2.568928 3.212158 4.380286 10
#> jsonlite 10.048730 10.181996 11.190389 10.640860 12.015381 13.486832 10
#>
#>
#> ********************** twitter.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 1.767523 2.140532 2.225777 2.201807 2.332409 2.853772 10
#> jsonlite 9.487721 9.667403 9.996055 9.958649 10.020133 10.824681 10
#>
#>
#> ********************** twitterescaped.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 1.890421 1.938308 2.100133 1.983586 2.315762 2.384929 10
#> jsonlite 5.442077 5.511250 5.902488 5.638079 6.064283 7.470741 10
#>
#>
#> ********************** update-center.json
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> simdjson 2.194598 2.274436 2.625021 2.625768 2.965949 3.070396 10
#> jsonlite 9.733094 9.853983 10.574512 10.206547 10.945863 12.812568 10
df <- do.call(rbind, bench_marks)
I've done some tests on my from_json()
and the bottleneck is coming from getting the data types inside each element. The jsonify
version is here, and my RcppSimdJson
version is here
(I've made this test small and quick 'cos I was getting annoyed waiting for it to run each time I did a test. But this result is representative of larger examples)
n <- 1e4L
df <- data.frame(x = 1L:n)
js <- jsonify::to_json( df )
microbenchmark::microbenchmark(
jsonify = { res <- jsonify:::rcpp_get_dtypes( js ) },
rcppsimd = { RcppSimdJson:::rcpp_get_dtypes( js ) }
)
# Unit: microseconds
# expr min lq mean median uq max neval
# jsonify 657.069 673.832 715.9397 697.081 730.814 1013.196 100
# rcppsimd 85469.820 86976.375 94386.8532 92468.832 98570.788 124948.499 100
@knapply FYI the get_dtypes()
gets the data types of each element inside an objet or an array, which I use to determine if the object can be simplified or not. In jsonify
this has very little cost, so I thought I could simply bring it across to here. But these tests suggest it's not the correct approach.
@dcooley Yea, that seems weird.
Have you tried it without using .get<SIMDJON-TYPE>()
(kinda like below) or not passing everything by reference?
After walking through and playing with jsonify
earlier, I'm a bit confused. Is there a reference you're using for the simplify routine? I'm getting different enough results when comparing to jsonlite
's that I'm not confident I get the rules.
test <- '{"test":[1,[2,[3]]]}'
jsonlite::fromJSON(test)
#> $test
#> $test[[1]]
#> [1] 1
#>
#> $test[[2]]
#> $test[[2]][[1]]
#> [1] 2
#>
#> $test[[2]][[2]]
#> [1] 3
jsonify::from_json(test)
#> $test
#> $test[[1]]
#> [1] 1
#>
#> $test[[2]]
#> [,1]
#> [1,] 2
#> [2,] 3
To me, there's nothing to simplify here, so jsonlite
is closer to what I'd expect (but still weird). That said, very little of the JSON data I deal with is numerical so I'm sure there's something I'm missing.
I also realized that things can be simplified down to matrices, which complicates the integer handling (possible integer64
matrices and beyond). Do you stop at matrices, or are 3D+ arrays possible?
Is there a reference you're using for the simplify routine
Not really a reference, but my rule is, round-trips have to work.
So if you simplified down to a matrix, you couldn't then get back to the [, [, [...]]]
structure.
So I'm using "simplify" to mean, the simplest structure possible, without breaking the original JSON structure.
But it looks like you've found an issue
Sorry! That wasn't what I intended. 🤦♂️😬
Edit: I'll move the example over there.
I have a deserialization routine that seems to check a lot of boxes (multiple levels of type-strictness and simplification): https://github.com/knapply/rcppsimdjson/tree/feature/deserialize
I haven't quite sorted out how to best handle nested data frames. The way jsonify and jsonlite go about it seems different enough that I need to reevaluate.
What I'm envisioning is being able to clone jsonify::from_json()
and whatever else, but having the individual parts be sufficiently modular that alternatives and modifications don't require a total rebuild. Be that here, or in other packages that want to leverage the pre-built boilerplate.
I'm sure the code has issues (C++ has been a total uphill battle for me), but the results seem promising.
json1 <- readr::read_file(
"~/Documents/rcppsimdjson/inst/jsonexamples/canada.json"
)
json2 <- readr::read_file(
"~/Documents/rcppsimdjson/inst/jsonexamples/gsoc-2018.json"
)
microbenchmark::microbenchmark(
rcppsimdjson1 = RcppSimdJson:::.deserialize_json(json1),
jsonify1 = jsonify::from_json(json1),
jsonlite = jsonlite::fromJSON(json1)
,
times = 3
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> rcppsimdjson1 4.474563 5.140195 6.12159 5.805827 6.945103 8.084379 3
#> jsonify1 45.232373 45.339811 47.12165 45.447250 48.066282 50.685314 3
#> jsonlite 462.022187 467.628038 478.05169 473.233888 486.066435 498.898981 3
microbenchmark::microbenchmark(
rcppsimdjson2 = RcppSimdJson:::.deserialize_json(json2),
jsonify2 = jsonify::from_json(json2),
jsonlite2 = jsonlite::fromJSON(json2)
,
times = 3
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> rcppsimdjson2 9.390968 10.85924 12.44297 12.32752 13.96897 15.61043 3
#> jsonify2 30.635981 30.80953 33.61642 30.98309 35.10664 39.23019 3
#> jsonlite2 96.001989 98.44159 108.28356 100.88118 114.42435 127.96752 3
This is what it looks like in action ...
type_policy <- list(
anything_goes = 0,
ints_as_dbl = 1,
strict = 2
)
int64_opt <- list(
double = 0,
string = 1,
integer64 = 2
)
js <- '[[1,2,3],
["4","5",null],
[1,2,3.3],
[true,false,true],
[10000000000,20000000000,30000000000]]'
RcppSimdJson:::.deserialize_json(js)
#> [,1] [,2] [,3]
#> [1,] "1" "2" "3"
#> [2,] "4" "5" NA
#> [3,] "1" "2" "3.30"
#> [4,] "TRUE" "FALSE" "TRUE"
#> [5,] "10000000000" "20000000000" "30000000000"
RcppSimdJson:::.deserialize_json(js, type_policy = type_policy$ints_as_dbl)
#> [[1]]
#> [1] 1 2 3
#>
#> [[2]]
#> [1] "4" "5" NA
#>
#> [[3]]
#> [1] 1.0 2.0 3.3
#>
#> [[4]]
#> [1] TRUE FALSE TRUE
#>
#> [[5]]
#> [1] 1e+10 2e+10 3e+10
RcppSimdJson:::.deserialize_json(js, type_policy = type_policy$strict)
#> [[1]]
#> [1] 1 2 3
#>
#> [[2]]
#> [1] "4" "5" NA
#>
#> [[3]]
#> [[3]][[1]]
#> [1] 1
#>
#> [[3]][[2]]
#> [1] 2
#>
#> [[3]][[3]]
#> [1] 3.3
#>
#>
#> [[4]]
#> [1] TRUE FALSE TRUE
#>
#> [[5]]
#> [1] 1e+10 2e+10 3e+10
RcppSimdJson:::.deserialize_json(js, type_policy = type_policy$strict,
int64_r_type = int64_opt$string)
#> [[1]]
#> [1] 1 2 3
#>
#> [[2]]
#> [1] "4" "5" NA
#>
#> [[3]]
#> [[3]][[1]]
#> [1] 1
#>
#> [[3]][[2]]
#> [1] 2
#>
#> [[3]][[3]]
#> [1] 3.3
#>
#>
#> [[4]]
#> [1] TRUE FALSE TRUE
#>
#> [[5]]
#> [1] "10000000000" "20000000000" "30000000000"
RcppSimdJson:::.deserialize_json(js, type_policy = type_policy$strict,
int64_r_type = int64_opt$integer64)
#> [[1]]
#> [1] 1 2 3
#>
#> [[2]]
#> [1] "4" "5" NA
#>
#> [[3]]
#> [[3]][[1]]
#> [1] 1
#>
#> [[3]][[2]]
#> [1] 2
#>
#> [[3]][[3]]
#> [1] 3.3
#>
#>
#> [[4]]
#> [1] TRUE FALSE TRUE
#>
#> [[5]]
#> integer64
#> [1] 10000000000 20000000000 30000000000
RcppSimdJson:::.deserialize_json('[{"id":1,"val":"a"},{"id":2,"val":"b"}]')
#> id val
#> 1 1 a
#> 2 2 b
RcppSimdJson:::.deserialize_json('[{"id":1,"val":"a"},{"id":2,"val":["b","c"]}]')
#> id val
#> 1 1 a
#> 2 2 b, c
RcppSimdJson:::.deserialize_json('[{"id":1,"val":"a"},{"id":2,"val":["b","c"]}]',
json_pointer = '1/val/0')
#> [1] "b"
... and these are the types of nested data frames that still need some thought...
x <- data.frame(driver = c("Bowser", "Peach"), occupation = c("Koopa", "Princess"))
x$vehicle <- data.frame(model = c("Piranha Prowler", "Royal Racer"))
x$vehicle$stats <- data.frame(speed = c(55, 56), weight = c(67, 24), drift = c(35, 32))
js <- jsonify::to_json(x)
str(jsonlite::fromJSON(js)) # identical() to jsonify
#> 'data.frame': 2 obs. of 3 variables:
#> $ driver : chr "Bowser" "Peach"
#> $ occupation: chr "Koopa" "Princess"
#> $ vehicle :'data.frame': 2 obs. of 2 variables:
#> ..$ model: chr "Piranha Prowler" "Royal Racer"
#> ..$ stats:'data.frame': 2 obs. of 3 variables:
#> .. ..$ speed : num 55 56
#> .. ..$ weight: num 67 24
#> .. ..$ drift : num 35 32
str(RcppSimdJson:::.deserialize_json(js))
#> 'data.frame': 2 obs. of 3 variables:
#> $ driver : chr "Bowser" "Peach"
#> $ occupation: chr "Koopa" "Princess"
#> $ vehicle :List of 2
#> ..$ :List of 2
#> .. ..$ model: chr "Piranha Prowler"
#> .. ..$ stats:List of 3
#> .. .. ..$ speed : num 55
#> .. .. ..$ weight: num 67
#> .. .. ..$ drift : num 35
#> ..$ :List of 2
#> .. ..$ model: chr "Royal Racer"
#> .. ..$ stats:List of 3
#> .. .. ..$ speed : num 56
#> .. .. ..$ weight: num 24
#> .. .. ..$ drift : num 32
As someone who came to R well after data.table and dplyr came about, the multi-column data frames that jsonify/jsonlite build are completely bizarre to me. There's also this...
Warning message:
In data.table::setDT(x) :
Some columns are a multi-column type (such as a matrix column): [3]. setDT will retain these columns as-is but subsequent operations like grouping and joining may fail. Please consider as.data.table() instead which will create a new column for each embedded column.
I'm not suggesting that only "enhanced" data frame users be considered, it's more that the power of RcppSimdJson is going to be in the ability to ingest yuge data sets, so being able to hand off to those packages with minimal fidgeting would be nice.
With that in mind, If there's a standard (or even legacy?) use-case for them, it'd be helpful to know what that is so we can consider what options best support that. If anyone has thoughts on that, I'd love to hear them.
The timings are very enticing. And being able to deal with 'simple' structures (but at scale) has total merit. Think ndjson
logs for example. Potentially yuge but not nested.
That's 99% my use-case, but sadly they're not always simple.
tweetio
began as an exercise to figure out how to handle large, complicated json streams while staying in R (and fortunately found some practical uses, but it's sorely in need of an update now that I kinda know what I'm doing).
The cool thing about simdjson's "JSON pointer" capability is that it will minimize the need for the insanely tedious mapping to custom structures I had to do there.
I'll pull the deserialize routine over here. It is not exactly simple (they type dynamism was... rough), but more sets of eyes may help.
I think as long as the underlying data relationships are maintained, then the R representation shouldn't really matter. So if there is a good way of representing nested JSON objects in a way suitable for data.table
, tibble
, whatever, etc, then I see no reason not to use those structures.
That's what I'm thinking as well, but I'm wondering if any folks rely on that structure. Just food for thought.
Here's a "fairer" benchmark from #17 using a bigger file.
# js <- readr::read_file("https://github.com/zemirco/sf-city-lots-json/raw/master/citylots.json")
js <- readr::read_file("~/Documents/citylots.json")
bench::mark(
rcppsimdjson = rcppsimdjson <- RcppSimdJson:::.deserialize_json(js),
jsonify = jsonify <- jsonify:::rcpp_from_json(js, simplify = T, fill_na = F),
jsonlite = jsonlite <- jsonlite:::parse_and_simplify(js, simplifyVector = T, simplifyDataFrame = T, simplifyMatrix = T)
,
filter_gc = FALSE,
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 rcppsimdjson 895.88ms 895.88ms 1.12 58.6MB 0
#> 2 jsonify 7.49s 7.49s 0.134 104.5MB 0.668
#> 3 jsonlite 37.85s 37.85s 0.0264 369MB 1.24
microbenchmark::microbenchmark(
rcppsimdjson = rcppsimdjson <- RcppSimdJson:::.deserialize_json(js),
jsonify = jsonify <- jsonify:::rcpp_from_json(js, simplify = T, fill_na = F),
jsonlite = jsonlite <- jsonlite:::parse_and_simplify(js, simplifyVector = T, simplifyDataFrame = T, simplifyMatrix = T)
,
times = 1
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> rcppsimdjson 887.617 887.617 887.617 887.617 887.617 887.617 1
#> jsonify 6964.756 6964.756 6964.756 6964.756 6964.756 6964.756 1
#> jsonlite 35507.670 35507.670 35507.670 35507.670 35507.670 35507.670 1
I'm not sure how accurate {bench}
's memory measurements actually are, but they seem to reflect my goal of diagnosing what R structures should look like upfront so they can be essentially treated as immutable once they're created and populated. Now that I think about it, that probably can (should?) be enforced.
It isn't surprising that the simplification process is the bottleneck, but it's much more than I expected. For comparison, this is what happens without any simplification.
bench::mark(
rcppsimdjson = rcppsimdjson <- RcppSimdJson:::.deserialize_json(js, simplify_to = 3),
jsonify = jsonify <- jsonify:::rcpp_from_json(js, simplify = F, fill_na = F),
jsonlite = jsonlite <- jsonlite:::parse_and_simplify(js, simplifyVector = F, simplifyDataFrame = F, simplifyMatrix = F)
,
filter_gc = FALSE,
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 rcppsimdjson 1.11s 1.11s 0.901 13.1MB 0
#> 2 jsonify 3.27s 3.27s 0.306 13.1MB 0.612
#> 3 jsonlite 6.68s 6.68s 0.150 13.1MB 0.150
To move the conversation forward about what the user-facing API should look like, here's a prototype w/ some data.table-inspired functionality (text, url, file download, decompress, etc), but "vecotrized" over multiple strings, URLs, and files.
Don't be shy. It's meant to instigate opinions, criticism, discussion etc. (and definitely has bugs)
files <- dir("~/Documents/rcppsimdjson/inst/jsonexamples/", pattern = "\\.json$", full.names = TRUE, recursive = TRUE)
urls <- c(
"https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/apache_builds.json",
"https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/mesh.json",
"https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/citm_catalog.json",
"https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/canada.json",
"https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/twitter.json",
"https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/github_events.json",
"https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/gsoc-2018.json"
)
gz_files <- sapply(
files[1:10],
function(.x) {
R.utils::compressFile(
.x, remove = FALSE, FUN = gzfile, ext = "gz",
destname = sprintf("%s/%s%s", tempdir(), basename(.x), ".gz")
)
}, USE.NAMES = FALSE
)
json_text <- c("[1,2,3]",'[4,5,6]')
fparse(json_text)
#> [[1]]
#> [1] 1 2 3
#>
#> [[2]]
#> [1] 4 5 6
parsed_files <- fparse(files)
names(parsed_files)
#> [1] "apache_builds.json"
#> [2] "canada.json"
#> [3] "citm_catalog.json"
#> [4] "github_events.json"
#> [5] "gsoc-2018.json"
#> [6] "instruments.json"
#> [7] "marine_ik.json"
#> [8] "mesh.json"
#> [9] "mesh.pretty.json"
#> [10] "numbers.json"
#> [11] "random.json"
#> [12] "adversarial.json"
#> [13] "demo.json"
#> [14] "flatadversarial.json"
#> [15] "che-1.geo.json"
#> [16] "che-2.geo.json"
#> [17] "che-3.geo.json"
#> [18] "google_maps_api_compact_response.json"
#> [19] "google_maps_api_response.json"
#> [20] "twitter_api_compact_response.json"
#> [21] "twitter_api_response.json"
#> [22] "repeat.json"
#> [23] "smalldemo.json"
#> [24] "truenull.json"
#> [25] "twitter_timeline.json"
#> [26] "twitter.json"
#> [27] "twitterescaped.json"
#> [28] "update-center.json"
download_and_parse_files <- fparse(urls)
names(download_and_parse_files)
#> [1] "apache_builds.json" "mesh.json" "citm_catalog.json"
#> [4] "canada.json" "twitter.json" "github_events.json"
#> [7] "gsoc-2018.json"
inflate_and_parse <- fparse(gz_files)
names(inflate_and_parse)
#> [1] "apache_builds.json.gz" "canada.json.gz" "citm_catalog.json.gz"
#> [4] "github_events.json.gz" "gsoc-2018.json.gz" "instruments.json.gz"
#> [7] "marine_ik.json.gz" "mesh.json.gz" "mesh.pretty.json.gz"
#> [10] "numbers.json.gz"
I think we can close this now that 0.1.0 is out. Please re-open with details if something is still amiss.
I've been working on a prototype
from_json()
functionality here in my fork, which follows the exact same logic asjsonify
A demo of its current output is
Are you happy for me to make a PR so this
from_json()
lives insideRcppSimdJson
, or would you preferRcppSimdJson
to remain as an 'Interface' library, clear of any R clutter?Also tagging in @knapply who has been working on something similar, who may have another implementation?