eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library
115 stars 13 forks source link

Export `.deserialize_json()` #78

Closed shikokuchuo closed 1 year ago

shikokuchuo commented 1 year ago

Please consider exporting .deserialize_json(), as fparse() performs a lot of safety checks, detracting needlessly from performance. Preferably with the 'simplify_to' argument defaulting to 3L, again for best performance. Thanks.

eddelbuettel commented 1 year ago

I am all ears. Can you possibly sketch it out in a PR (that we maybe iterate over) ?

eddelbuettel commented 1 year ago

You can already call it. Isn't that good enough?

> j <- jsonlite::toJSON(data.frame(a=c(42L, 88L), b=c("brown", "fox")))
> j
[{"a":42,"b":"brown"},{"a":88,"b":"fox"}] 
> RcppSimdJson:::.deserialize_json(j)
   a     b
1 42 brown
2 88   fox
> 
shikokuchuo commented 1 year ago

I have created the PR for your perusal.

The overhead from using ::: ...

eddelbuettel commented 1 year ago

(See discussion at your PR #79)

dcooley commented 1 year ago

it would allow me to call it from a CRAN package (although that is not a priority)

If this is your goal, you can LinkTo this library, include the RcppSimdJson/deserialize.hpp and write your own version of deserialize in your own package?

shikokuchuo commented 1 year ago

@dcooley that's interesting I had not considered something like that. I have my own fork for my own use. I just thought other CRAN users may want to benefit from the same tweaks.

eddelbuettel commented 1 year ago
  1. It's LinkingTo and it exposes header files. Actually exporting (at the foreign function interface C level calls across packages is more powerful. We could even export the internal C++ function with its SEXP only C interface).
  2. But we still have not seen any rationalization for the rumoured speed increase and what your usage pattern: few large JSON blobs? Many small ones?
  3. You still can today workaround the ::: limitation in CRAN package checks. Many packages do.
eddelbuettel commented 1 year ago

I do not think we are spending time on worthwhile initiatives here:

> jsoninputfile <- system.file("jsonexamples", "twitter.json", package="RcppSimdJson")
> jsoninput <- paste(readLines(jsoninputfile), collapse="")
> res <- rbenchmark::benchmark(RcppSimdJson::fparse(jsoninput), RcppSimdJson:::.deserialize_json(jsoninput))
> res[,1:4]
                                         test replications elapsed relative
2 RcppSimdJson:::.deserialize_json(jsoninput)          100   0.240    1.000
1             RcppSimdJson::fparse(jsoninput)          100   0.248    1.033
> 

The difference is essentially the same as measurement noise.

eddelbuettel commented 1 year ago

As I am the one having to maintain an additional interface, which is one we actually kept unexposed for a reason, I am not in favor of the proposed change. There were a number of small things wrong in https://github.com/eddelbuettel/rcppsimdjson/pull/79, and they were not addressed but argued over so I closed this.

For valid demonstrated reason I could consider adding another 'narrower' converter. So far, such a reason has not been provided.

shikokuchuo commented 1 year ago

It is your package so the design points are totally up to you. I merely addressed each of your comments in PR #79, I am sorry if they appeared argumentative to you that was never the intention. As mentioned above, I just wanted to share my tweaks. Thanks.

eddelbuettel commented 1 year ago

I am not against helping you. But I asked you three times to show me representative input and what timing difference you see on it. You still have not done it.

The worst outcome, clearly, is for you to run away and to shadow or vendor or re-package simdjson needlessly. Nobody is aided by that. But so far I am simply not seeing a reasonable speed gain from going directory to the currently internal-only function.

shikokuchuo commented 1 year ago

Right - by your own benchmark the difference is 8ms. 3% of the timing of your large chunk of json. Imagine lots of small chunks, then that 8ms becomes a large %.

shikokuchuo commented 1 year ago

What I'm trying to get at is that SimdJson can parse a LOT of json in 8 milliseconds, so for just a wrapper to take 8 milliseconds seems a bit of a waste.

eddelbuettel commented 1 year ago

That is 8 ms over 100 runs. Also, 8 / 240 is about 3.33 per cent -- exactly what the relative change column shows. And if you ever looked at benchmarking in any detail you would be aware that such small values are of the order that randomly changes, Morever, what we get with RcppSimdJson is an orders of magnitude difference to alternative. If all that is not good enough for you (in the context of an interpreted language) well then I guess I cannot stop you from reinventing that wheel you so desparately crave.

eddelbuettel commented 1 year ago

And for completeness a re-run with simplify_to=3l (and some more runs) ends up again in the same vicinity of a 2% improvement:


> res <- rbenchmark::benchmark(RcppSimdJson::fparse(jsoninput), RcppSimdJson:::.deserialize_json(jsoninput, simplify_to=3L), replications=500)
> res[,1:4]
                                                          test replications elapsed relative
2 RcppSimdJson:::.deserialize_json(jsoninput, simplify_to = 3)          500   1.264    1.000
1                              RcppSimdJson::fparse(jsoninput)          500   1.290    1.021
> 
eddelbuettel commented 1 year ago

(I currently have my hands full with another issue but I think I can help you by making the currently internal function available for you as a C function you can hit directly from C code. That way you save the R call overhead too. I am doing that a little bit more e.g. in package RcppSpdlog which exports a few C-level plain functions for logging which for example the (nicer to use R wrapper) spdl picks up. Let me know if that would be of interest. I am currently tied up with another package but I prepare a basic test case pretty quickly...)

shikokuchuo commented 1 year ago

If you make it available at the C level then I would definitely make use of it. Great idea if you are doing that more with your other packages. No rush at all, this is not urgent for me. Thanks.