eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library
115 stars 13 forks source link

Feature/simdjson utils #58

Closed knapply closed 3 years ago

knapply commented 3 years ago

@eddelbuettel This is largely complete.

It adds is_valid_utf8(), is_valid_json(), and fminify() (I don't think there's a built-in way to do an fprettify() at the moment).

They're all vectorized (no more vapply(json, jsonlite::validate, logical(1L))!) and work on characters, raws, and lists of raws.

I need to step away and come back with fresh eyes, but all that should be needed is a fresh coat of paint on the documentation with some examples and to ensure the arguments are sufficiently validated (it's not quite as rigorous as fload()/fparse() yet).

Needless to say, the wizards working upstream have made everything obscenely fast...

all_files <- list.files(system.file("jsonexamples", package = "RcppSimdJson"),
                             recursive = TRUE, full.names = TRUE)
all_text <- vapply(all_files, function(.x) readChar(.x, file.size(.x)), character(1L))

microbenchmark::microbenchmark(
    simdjson = RcppSimdJson::is_valid_utf8(all_text),
    base = base::validUTF8(all_text),
    check = "identical"
)
#> Unit: microseconds
#>      expr      min        lq      mean   median        uq      max neval
#>  simdjson  172.679  177.3735  266.5388  210.508  299.5485  870.676   100
#>      base 2950.096 2974.0835 3249.8992 3155.039 3486.4265 4042.592   100

all_json_files <- list.files(system.file("jsonexamples", package = "RcppSimdJson"),
                             pattern = "\\.json$",
                             recursive = TRUE, full.names = TRUE)
all_json <- vapply(all_json_files, function(.x) readChar(.x, file.size(.x)), character(1L))

# validate single JSON string
microbenchmark::microbenchmark(
    simdjson = RcppSimdJson::is_valid_json(all_json[[1L]]),
    jsonify = jsonify::validate_json(all_json[[1L]]),
    jsonlite = jsonlite::validate(all_json[[1L]]),
    rjsonio = RJSONIO::isValidJSON(all_json[[1L]], asText = TRUE),
    check = "identical"
)
#> Registered S3 method overwritten by 'jsonlite':
#>   method     from   
#>   print.json jsonify
#> Unit: microseconds
#>      expr     min       lq      mean   median       uq     max neval
#>  simdjson  52.935  55.5880  57.88668  56.8310  58.8025  81.423   100
#>   jsonify 422.420 429.0455 443.11886 436.1015 443.3170 569.762   100
#>  jsonlite 585.439 593.1465 613.10092 602.6540 612.7265 703.933   100
#>   rjsonio 199.920 204.9015 213.77264 208.0955 213.8235 284.242   100

# validate many JSON strings
microbenchmark::microbenchmark(
    simdjson = RcppSimdJson::is_valid_json(all_json),
    jsonify = jsonify::validate_json(all_json),
    jsonlite = vapply(all_json, jsonlite::validate, logical(1L), USE.NAMES = FALSE),
    rjsonio = vapply(all_json, RJSONIO::isValidJSON, logical(1L), asText = TRUE, USE.NAMES = FALSE),
    check = "identical"
)
#> Unit: milliseconds
#>      expr       min        lq      mean    median        uq      max neval
#>  simdjson  2.642492  2.824298  3.369945  3.214475  3.503424 12.00913   100
#>   jsonify 13.668904 14.001894 14.977280 14.495898 15.279176 24.76140   100
#>  jsonlite 29.720667 31.003672 32.481978 32.183417 33.693321 38.51637   100
#>   rjsonio  6.834025  7.142605  7.701896  7.370767  7.895685 20.69166   100

# minify single JSON
microbenchmark::microbenchmark(
    simdjson = RcppSimdJson::fminify(all_json[[1L]]),
    jsonify = jsonify::minify_json(all_json[[1L]]),
    jsonlite = jsonlite::minify(all_json[[1L]])
)
#> Unit: microseconds
#>      expr     min       lq      mean   median        uq      max neval
#>  simdjson 241.586 253.3720  270.2938 260.8855  265.6235  538.340   100
#>   jsonify 851.640 896.4275  931.5250 914.9115  948.8345 1158.812   100
#>  jsonlite 911.802 962.1780 1006.2681 981.4720 1015.7915 1492.091   100

# minify many JSON
microbenchmark::microbenchmark(
    simdjson = RcppSimdJson::fminify(all_json),
    jsonify = vapply(all_json, jsonify::minify_json, character(1L), USE.NAMES = FALSE),
    jsonlite = vapply(all_json, jsonlite::minify, character(1L), USE.NAMES = FALSE)
)
#> Unit: milliseconds
#>      expr      min       lq     mean   median       uq      max neval
#>  simdjson 11.86990 12.60303 13.39678 13.14752 13.97499 18.53361   100
#>   jsonify 30.26910 31.48481 33.11033 32.93337 34.30477 38.01050   100
#>  jsonlite 41.14253 42.86855 44.70557 44.63316 46.02481 51.91243   100
codecov[bot] commented 3 years ago

Codecov Report

Merging #58 into master will increase coverage by 1.66%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   94.28%   95.95%   +1.66%     
==========================================
  Files          17       18       +1     
  Lines        1312     1408      +96     
==========================================
+ Hits         1237     1351     +114     
+ Misses         75       57      -18     
Impacted Files Coverage Δ
inst/include/RcppSimdJson/deserialize.hpp 91.86% <ø> (+5.08%) :arrow_up:
src/exported-utils.cpp 100.00% <100.00%> (ø)
src/simdjson_example.cpp 100.00% <100.00%> (+3.70%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0a2537d...1108268. Read the comment docs.

eddelbuettel commented 3 years ago

Do you want to flip it from draft to genuine PR? Or are there more parts you think are missing?

knapply commented 3 years ago

Do you want to flip it from draft to genuine PR? Or are there more parts you think are missing?

Sure thing. I was just waiting for the CI to finish, but we should be good.

eddelbuettel commented 3 years ago

Which seems to start in slo-mo these days. [ And because I have more than $THRESHOLD repos I can't even auto-migrate to travis-ci.com. Rock, meet hard place. ]

Any reason not to fold this up and ship it to CRAN? (After one more round of win-builder / rhub of course.)

knapply commented 3 years ago

Nope, I can't think of anything.

eddelbuettel commented 3 years ago

Alrighty --merging and moving right along then.

eddelbuettel commented 3 years ago

Wrapped up and shipped to CRAN. Tickled a 'needs human review' because (I think) the Windows box has now GitHub PAT and hits link-access limits :crying_cat_face: as well as possibly to two existing build failures on the old box. I would expect it to fly through once they get to it, likely tomorrow (European hours) morning.