NicolasJiaxin commented 2 years ago

This PR is only a prototype. I gladly invite people to rewrite the code or redesign the code.

The goal of this PR was to see if simdjson On Demand API was mature enough to replace the DOM API currently used for RcppSimdJson. This PR provides a working prototype for this. For those unfamiliar with the On Demand API, here is a brief comparison of the two approaches:

Given a JSON document, DOM first scans the document to detect structural indexes (things such as openings and closing brackets, etc.). Then, it constructs a tape that represents the structure of the document. On the other hand, On Demand also starts by indexing the documents (exactly the same code as DOM). However, it does not construct a tape. Instead, it uses those structural indexes to navigate/jump around. This has a few constraints though:

Iteration/parsing is forward only for the most part (array/object/document can be reset/rewinded).
Values are not pre-parsed.
JSON syntax validation is only done when accessing elements and values.

Now, regarding RcppSimdJson, the design is to scan through the JSON document to determine the type of the values in it. Based on the structure it scanned, it tries to return the JSON first as a dataframe, then a matrix, then a vector, and finally as a list. Here is a list of changes I brought:

To make the above design work with On Demand, I inserted array.reset() and object.reset() in appropriate places. Ideally, we would like to scan the document once, reset once, and then go through the document a second time to put elements into an appropriate R structure. However, as of now, I think I have multiple reset for matrices and dataframes.
Since values are not preparsed, On Demand cannot distinguish number types(int, uint, double) "on its own" which is a crucial part of RcppSimdJson. That is where ondemand::number helps (credits to @lemire). To accomodate ondemand::number_type(enum class for numbers) with ondemand::json_type(enum class for JSON types), I created a enum class utils::complete_json_type which includes both types for JSON elements and numbers. This was useful for matrices to determine common_element_type.
ondemand::document plays a bigger role in the On Demand API compared to dom::document in the DOM API. For that reason, I also had to modify deserialize.hpp to deal with ondemand::document (scalar and non-scalar documents).
Finally, because JSON validation is done only at access, I had to change a few tests in tinytest, and I commented on those tests why I was changing it.

Other than that, the design is pretty much the same. I do not know if this is the best approach for On Demand, but as I said, the goal of this PR was only to produce a working prototype. Contributions are welcomed to make a working, optimized version when simdjson v1.0 is released (very soon).

eddelbuettel commented 2 years ago

I am a little tied up right now but should get a chance to play at some point 'soon'.

codecov[bot] commented 2 years ago

Codecov Report

Merging #75 (0b8a968) into master (de81b7c) will decrease coverage by 10.78%. The diff coverage is 90.66%.

@@             Coverage Diff             @@
##           master      #75       +/-   ##
===========================================
- Coverage   99.55%   88.76%   -10.79%     
===========================================
  Files          18       18               
  Lines        1336     1433       +97     
===========================================
- Hits         1330     1272       -58     
- Misses          6      161      +155

Impacted Files	Coverage Δ
R/fparse.R	`100.00% <ø> (ø)`
inst/include/RcppSimdJson/common.hpp	`100.00% <ø> (ø)`
inst/include/RcppSimdJson/utils.hpp	`100.00% <ø> (ø)`
inst/include/RcppSimdJson/deserialize.hpp	`62.86% <79.06%> (-37.14%)`	:arrow_down:
inst/include/RcppSimdJson/deserialize/simplify.hpp	`94.62% <90.38%> (-5.38%)`	:arrow_down:
inst/include/RcppSimdJson/deserialize/scalar.hpp	`93.24% <91.52%> (-6.76%)`	:arrow_down:
...nst/include/RcppSimdJson/deserialize/dataframe.hpp	`91.42% <91.66%> (-8.58%)`	:arrow_down:
inst/include/RcppSimdJson/deserialize/vector.hpp	`95.52% <92.85%> (-4.48%)`	:arrow_down:
inst/include/RcppSimdJson/deserialize/matrix.hpp	`97.22% <95.16%> (-2.78%)`	:arrow_down:
...t/include/RcppSimdJson/deserialize/Type_Doctor.hpp	`96.38% <98.21%> (-3.62%)`	:arrow_down:
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update de81b7c...0b8a968. Read the comment docs.

NicolasJiaxin commented 2 years ago

One of the issue I have mentioned here is that in the current design, when building dataframes, we build columns by columns. However, this means we have to rewind for every columns which is not optimal. We would rather build row by row, but I am not sure how feasible is this. This could be a starting if someone wants to improve on the current design.

eddelbuettel commented 2 years ago

That's what R does too -- data.frames are lists of column vectors. So while you could prepare of (empty, growable) vectors and grow them row by row it is not what is commonly done (say when talking to DBs). Of course, here other things may be more importand we could accomodate row-wise growth (esp via STL containers).

lemire commented 2 years ago

Thanks @NicolasJiaxin.

@eddelbuettel This PR elegantly avoids the nasty integer/float issue, and seems to get most things right. It is a prototype, as clearly stated... but it is sane.

There is a balancing act between visiting arrays/objects twice, and other strategies, but they are best addressed experimentally.

eddelbuettel commented 2 years ago

(Benefit of zoom office hours when nobody shows up ....)

Took a look and it looks great! Two pretty-pleases that should be easy (I can even puch back to you repo if you want):

Please rebase, I did three minor household commits that you could integrate for a cleaner ff-only merge:

* | de81b7c - (origin/master, master, feature/check_cxx11) detail that C++17 is in fact currently required (2 weeks ago) <Dirk Eddelbuettel>
* | 977db16 - detail optional C++17 use in SystemRequirements (closes #74) (2 weeks ago) <Dirk Eddelbuettel>
* | 26c369e - minor README.md edit (4 weeks ago) <Dirk Eddelbuettel>

R whines when compiling. It would be nice if this were silent:

edd@rob:~/git/rcppsimdjson(pr/75)$ install.r
* installing *source* package found in current working directory ...
* installing *source* package ‘RcppSimdJson’ ...
** using staged installation
** setting up C++17
** libs
ccache g++  -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/local/lib/R/site-library/Rcpp/include'   -DSIMDJSON_NO_COMPUTED_GOTO -I../inst/include -fopenmp -fpic  -g -O3 -Wall -pipe -pedantic -Wno-misleading-indentation -Wno-unused -Wno-ignored-attributes -Wno-class-memaccess -c RcppExports.cpp -o RcppExports.o
ccache g++  -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/local/lib/R/site-library/Rcpp/include'   -DSIMDJSON_NO_COMPUTED_GOTO -I../inst/include -fopenmp -fpic  -g -O3 -Wall -pipe -pedantic -Wno-misleading-indentation -Wno-unused -Wno-ignored-attributes -Wno-class-memaccess -c deserialize.cpp -o deserialize.o
In file included from ../inst/include/RcppSimdJson/deserialize/../common.hpp:145,
                 from ../inst/include/RcppSimdJson/deserialize/simplify.hpp:5,
                 from ../inst/include/RcppSimdJson/deserialize.hpp:5,
                 from ../inst/include/RcppSimdJson.hpp:5,
                 from deserialize.cpp:2:
../inst/include/RcppSimdJson/deserialize/../../simdjson.h:27480:137: warning: extra ‘;’ [-Wpedantic]
27480 | simdjson_really_inline simdjson_result<const char *> document_reference::current_location() noexcept { return doc->current_location(); };
      |                                                                                                                                         ^
In file included from ../inst/include/RcppSimdJson/deserialize.hpp:5,
                 from ../inst/include/RcppSimdJson.hpp:5,
                 from deserialize.cpp:2:
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::Double; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:431:90:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
  211 |     switch (doc.type()) {
      |     ^~~~~~
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::String; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:433:90:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::Integer64; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:435:93:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::Always; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:437:90:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
ccache g++  -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/local/lib/R/site-library/Rcpp/include'   -DSIMDJSON_NO_COMPUTED_GOTO -I../inst/include -fopenmp -fpic  -g -O3 -Wall -pipe -pedantic -Wno-misleading-indentation -Wno-unused -Wno-ignored-attributes -Wno-class-memaccess -c exported-utils.cpp -o exported-utils.o
In file included from ../inst/include/RcppSimdJson/deserialize/../common.hpp:145,
                 from ../inst/include/RcppSimdJson/deserialize/simplify.hpp:5,
                 from ../inst/include/RcppSimdJson/deserialize.hpp:5,
                 from ../inst/include/RcppSimdJson.hpp:5,
                 from exported-utils.cpp:1:
../inst/include/RcppSimdJson/deserialize/../../simdjson.h:27480:137: warning: extra ‘;’ [-Wpedantic]
27480 | simdjson_really_inline simdjson_result<const char *> document_reference::current_location() noexcept { return doc->current_location(); };
      |                                                                                                                                         ^
In file included from ../inst/include/RcppSimdJson/deserialize.hpp:5,
                 from ../inst/include/RcppSimdJson.hpp:5,
                 from exported-utils.cpp:1:
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::Double; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:431:90:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
  211 |     switch (doc.type()) {
      |     ^~~~~~
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::String; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:433:90:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::Integer64; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:435:93:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::Always; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:437:90:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
ccache g++  -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/local/lib/R/site-library/Rcpp/include'   -DSIMDJSON_NO_COMPUTED_GOTO -I../inst/include -fopenmp -fpic  -g -O3 -Wall -pipe -pedantic -Wno-misleading-indentation -Wno-unused -Wno-ignored-attributes -Wno-class-memaccess -c internal-utils.cpp -o internal-utils.o
ccache g++  -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/local/lib/R/site-library/Rcpp/include'   -DSIMDJSON_NO_COMPUTED_GOTO -I../inst/include -fopenmp -fpic  -g -O3 -Wall -pipe -pedantic -Wno-misleading-indentation -Wno-unused -Wno-ignored-attributes -Wno-class-memaccess -c rcppsimdjson_utils_check.cpp -o rcppsimdjson_utils_check.o
In file included from ../inst/include/RcppSimdJson/deserialize/../common.hpp:145,
                 from ../inst/include/RcppSimdJson/deserialize/simplify.hpp:5,
                 from ../inst/include/RcppSimdJson/deserialize.hpp:5,
                 from ../inst/include/RcppSimdJson.hpp:5,
                 from rcppsimdjson_utils_check.cpp:3:
../inst/include/RcppSimdJson/deserialize/../../simdjson.h:27480:137: warning: extra ‘;’ [-Wpedantic]
27480 | simdjson_really_inline simdjson_result<const char *> document_reference::current_location() noexcept { return doc->current_location(); };
      |                                                                                                                                         ^
In file included from ../inst/include/RcppSimdJson/deserialize.hpp:5,
                 from ../inst/include/RcppSimdJson.hpp:5,
                 from rcppsimdjson_utils_check.cpp:3:
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::Double; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:431:90:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
  211 |     switch (doc.type()) {
      |     ^~~~~~
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::String; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:433:90:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::Integer64; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:435:93:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp: In instantiation of ‘SEXPREC* rcppsimdjson::deserialize::simplify_scalar_document(simdjson::fallback::ondemand::document_reference, SEXP) [with rcppsimdjson::utils::Int64_R_Type int64_opt = rcppsimdjson::utils::Int64_R_Type::Always; SEXP = SEXPREC*]’:
../inst/include/RcppSimdJson/deserialize.hpp:437:90:   required from here
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘array’ not handled in switch [-Wswitch]
../inst/include/RcppSimdJson/deserialize/simplify.hpp:211:5: warning: enumeration value ‘object’ not handled in switch [-Wswitch]
ccache g++  -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/local/lib/R/site-library/Rcpp/include'   -DSIMDJSON_NO_COMPUTED_GOTO -I../inst/include -fopenmp -fpic  -g -O3 -Wall -pipe -pedantic -Wno-misleading-indentation -Wno-unused -Wno-ignored-attributes -Wno-class-memaccess -c simdjson_example.cpp -o simdjson_example.o
In file included from simdjson_example.cpp:6:
../inst/include/simdjson.h:27480:137: warning: extra ‘;’ [-Wpedantic]
27480 | simdjson_really_inline simdjson_result<const char *> document_reference::current_location() noexcept { return doc->current_location(); };
      |                                                                                                                                         ^
ccache g++ -std=gnu++17 -Wl,-S -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -o RcppSimdJson.so RcppExports.o deserialize.o exported-utils.o internal-utils.o rcppsimdjson_utils_check.o simdjson_example.o -fopenmp -L/usr/lib/R/lib -lR
installing to /usr/local/lib/R/site-library/00LOCK-rcppsimdjson/00new/RcppSimdJson/libs
** R
** demo
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (RcppSimdJson)
edd@rob:~/git/rcppsimdjson(pr/75)$

But it (importantly) tests cleanly, pre- or post install:

edd@rob:~/git/rcppsimdjson(pr/75)$ tt.r -n 1 -p RcppSimdJson
test_compressed_files.R.......   28 tests OK 7.9s
test_deserialization.R........  117 tests OK 0.1s
test_encoding.R...............   12 tests OK 12ms
test_fparse_fload.R...........  168 tests OK 0.5s
test_int64.R..................    1 tests OK 2ms
test_issues.R.................   16 tests OK 19ms
test_load_json.R..............   29 tests OK 100ms
test_misc.R...................    1 tests OK 2ms
test_query.R..................   45 tests OK 35ms
test_simdjson_utils.R.........   37 tests OK 36ms
test_validation.R.............   30 tests OK 19ms
test_vectorized_ops.R.........    5 tests OK 8ms
All ok, 489 results (8.8s)
edd@rob:~/git/rcppsimdjson(pr/75)$

(Here build.r, install.r, rcpp.r, tt.r are little helpers I wrote for eddelbuettel/littler, see inst/examples, because I like to work that way. Standard R calls under them.)

@knapply When you have a moment...

NicolasJiaxin commented 2 years ago

@eddelbuettel I fixed the warnings you mentioned. It should all be silent now!

eddelbuettel commented 2 years ago

Very nice!

And I just tried to rebase (which worked cleanly) but I cannot push back to you because ... you actually forked from the fork by @lemire. So to cut out the middle man (salut, Daniel!) try this

git remote add parentofparentrepo git@github.com:eddelbuettel/rcppsimdjson.git
git pull --all
git rebase parentofparentrepo/master                                             # should work cleanly
git push --force-with-lease origin clean-updating-ondemand

which should also update the PR.

lemire commented 2 years ago

@eddelbuettel @NicolasJiaxin Yes, please cut me out. I think I only served as the middle man so that we would not spam the main repo while building it up.

NicolasJiaxin commented 2 years ago

I think this should do it. Thanks @eddelbuettel!

eddelbuettel commented 2 years ago

It's a lot of commits. Which is kinda impressive but still a lot. I am not too adamant about it either way but among you all (i.e., @knapply @lemire @NicolasJiaxin ) any preference for squash-merge or standard fast-forward? I may just do the latter...

eddelbuettel commented 2 years ago

Not shabby for a "prototype" :)

(but a little behind what we have on the GitHub README.md fromt the previous implementation -- so a small net cost to 'on demand' ?)

NicolasJiaxin commented 2 years ago

It's a lot of commits. Which is kinda impressive but still a lot. I am not too adamant about it either way but among you all (i.e., @knapply @lemire @NicolasJiaxin ) any preference for squash-merge or standard fast-forward? I may just do the latter...

I think squash-merge is more appropriate. A lot of the commits are not very meaningful (some are just trials and errors when I was unable to run tests locally).

NicolasJiaxin commented 2 years ago

Not shabby for a "prototype" :)

That is very nice to see! However, the expectation is that On Demand should be faster, so this means there is probably work to do here if we want to switch over to the On Demand API.

eddelbuettel commented 2 years ago

Sounds good. It looks like a genuine ton of really nice work. I am usually sticker asking contributor to summarise in ChangeLog too :) but I happily fill in (but if you won'. Will merge now, we can let it simmer to 1.0.0 is done upstream and then updated as needed and release.

NicolasJiaxin commented 2 years ago

@eddelbuettel I updated the ChangeLog. I was unsure on what to report exactly since there were many changes. Let me know if you want me to be more descriptive.

eddelbuettel commented 2 years ago

Emacs users do form a tribe :)

(From looking at the changelog here it seems the * are not aligned but that too is simple to fix. Appreciate the entry!)

lemire commented 2 years ago

@eddelbuettel On Demand should boost performance. Maybe not a whole lot, but at least a bit. If you are seeing a performance regression compared to your previous wrapper, then it suggests that further tuning is required.

(It does not follow that it is needed because benchmarking can be tricky and what looks at first like a regression is not a regression... let us always be careful. Benchmarking is hard.)

We have extensive benchmarks showing how On Demand is superior when used within C++.

With On Demand, if you open a file containing 1000 numbers and you need to put the numbers into your own data structure, they are materialized there directly, without being first written to a temporary tree. Thus we expect that the simdjson On Demand might often provide superior performance.

It is not guaranteed, of course, because since there is no materialization ever, we always point at the document and it is maybe needed to rescan the same data multiple times. That can eat into our gains.

So if we need to scan the same data multiple times, we can lose performance.

The best way to investigate these issues is to run some experiments to try to determine the bottlenecks and solve them.

lemire commented 2 years ago

@eddelbuettel I think that the problem with this PR might be as follow. Number parsing is expensive. This PR might do number parsing twice.

So you are given...

[1,2.,3]

You first go through it, find out that you have "integer, float, integer". Then you throw away the parsed numbers.

You then resume and reparse the whole thing a second time, after initiating a structure that matches the detected types the first time around.

That's not great.

The idea with On Demand is that we never store the data... we don't have a DOM tree. You, the caller, store the data.

lemire commented 2 years ago

So it is possible that we need to look at the design more seriously to avoid double number parsing.

eddelbuettel commented 2 years ago

The only reason I brought the demo benchmark is up that it is quite literally unchanged and from 'eyeballing' looked like the relative gain of our approach was smaller. The next methods come in at around ~ 10ms, we are now around 5ms but maybe were at 3ms according to the older pic.

This is really hand-wavy on my part but all we do here is have RcppSimdJson::fparse(json) be timed 100 times (on the twitter data).

lemire commented 2 years ago

@eddelbuettel

Ok. So it is important to see that On Demand is just a pointer inside the document. It has no storage and no memory. It just moves through the document. Superficially, it looks like DOM, but it is really just a thin iterator over the input string. So it might be looking at 123213.0, ffdsfdsdsf.... It is pointing at 1. You ask the type and it tells you that it must be a number (it knows because it is looking at a digit). But it does not know that it is 123213.0. Now if you say "give me a binary64 number (double)", it will attempt to parse 123213.0 as such. We normally move forward, from one JSON token to another, but we also introduced a 'reset' or 'rewind' functionality so that you can scan through an array, say, twice. The idea is that you can use the first pass to gather some information because doing the second and final pass. That's not ideal. The best way to use On Demand is really to just move forward. But we understand that sometimes, two passes are needed.

My first instinct was to represent everything using 64-bit floats. This automatically gives you 32-bit integers since they can be represented exactly in 64-bit floats. This makes the On Demand wrapper really fast because simdjson is very good at identifying numbers. It is essentially free. Indeed, just the first character of a token tells you whether it is a number (0-9 and -).

But then, if you need distinguish between multiple nuanced number types, it gets harder.

E.g., if you need to distinguish between 132321.0and 132321, then you need to process the string. Now it seems that R also needs to support 64-bit integers. So the wrapper also needs to figure how if it is an integer, and if it is, whether it exceeds or matches 1<<63.

That makes everything much harder and expensive because you first pass is no longer trivial.

So the current solution in this PR is to process the whole string, take 132321.0 compute the float value. Report that it is a float. Throw away the result. Continue processing the data. Rewind at the beginning, then come back to 132321.0 compute the float value, and parse it again.

Keep in mind that parsing numbers is just about the most expensive task we have to do with JSON parsing. So doubling the number parsing is definitively not good.

What @NicolasJiaxin suggested is that we add is_integer, is_big_integer (for integers exceeding or matching 1<<63). So in a first pass, we would scan each number, possibly twice, to categorize it, and the come back later and rescan it for a final parse. Maybe I misunderstand Nicolas' idea... But... no matter what, we end up scanning each number string at least twice, if not three times. Compared to a DOM approach, where we construct a DOM tree within simdjson, that's not advantageous. The DOM tree will materialize each number just once.

Ultimately, if you need to scan the JSON tree multiple times, and you need fine-grain type information, then On Demand might not be best. It is possible that my earlier PR at https://github.com/eddelbuettel/rcppsimdjson/pull/70 is better. (That would not make me happy, but whether I am happy or not is not relevant.)

(I do not know if I am stating is accurate. We really need to investigate this seriously. I just want to explain how On Demand works.)

lemire commented 2 years ago

@eddelbuettel One way to explain On Demand is to say it is lazy. It is laziest parser in the world. It tries to do as little work as it can. It pushes the hard work to you (the user) thinking that you can do it faster and better.

eddelbuettel commented 2 years ago

The performance is also 'quite fine by me' as it is still leaps and bounds ahead of the others, and now with an inner implementation you two like better / find more maintenable. Works for me. I just out that we apparently regressed a little from the earlier implementation, which always stings a little.

eddelbuettel commented 2 years ago

I will wait one more day for @knapply and if we don't hear from him I will likely merge.

lemire commented 2 years ago

@eddelbuettel I suggest that if you do merge, you add an issue to the effect that there are probably significant performance gains to be had by avoiding "double number parsing". I can help writing the issue, or even write the issue myself. No performance regression should be happening.

eddelbuettel commented 2 years ago

Yes please sketch something here. I am apparently not fully awake yet and missed that you suggested something operational now on the user side. Clue me in, please.

lemire commented 2 years ago

@eddelbuettel Here is a sketch:

Title: Significant optimization opportunity

Rcppsimdjson adopted the On Demand front-end from the simdjson library. On Demand provides a thin/lazy indexed pointer over the input JSON string. On Demand has no storage and no memory, it just moves through the document. Superficially, it looks like a conventional DOM-based parser, but it is really just a thin iterator over the input string. So it might be looking at the beginning of the string 123213.0, ffdsfdsdsf. If you ask the type, the On Demand front-end will look at the current character (1), it detects it must be a number (it knows because it is looking at a digit). But On Demand does not know that it is the number 123213.0. You can then ask On Demand "give me a binary64 number (double)", and it will attempt to parse 123213.0 as such. We normally move forward, from one JSON token to another, but On Demand also has a 'reset' or 'rewind' functionality so that you can scan through an array twice. The idea is that you can use the first pass to gather some information before doing the second and final pass.

If you could just represent all numbers using a single type (say 64-bit floats), then it would be easy. However, Rcppsimdjson chooses to support multiple number types (64-bit signed integers, 64-bit unsigned integers and 64-bit floating-point numbers).

To support dynamically-typed numbers, in simdjson, we provide a get_number() functionality that returns a number type that can be any of those (64-bit signed integers, 64-bit unsigned integers and 64-bit floating-point numbers). It autodetects the best type given a number. For example, 9223372036854775808 is represented using a 64-bit unsigned integer, 9223372036854775808.0 is represented as a 64-bit floating-point number and so forth. The get_number() functions return a structure that stores both the value and its type.

Currently, the wrapper does two passes over the JSON input, fully parsing each number twice. The first time we call get_number(), grab the number type and throw away its value. Then in a second step, the JSON input is rescanned a second time and each number is parsed again.

It is somewhat unfortunate. Here are two possibilities that would allow us to go faster by avoiding parsing each number twice.

If we could, instead, just distinguish between floating-point numbers and integers instead of distinguishing between unsigned and signed integers as well as floats, we could go faster right away with little effort. That is, if we just have two types (integers and floats), we can go faster and have simpler code. We have a is_integer() function that does not require full number parsing. Determining whether a number string is an integer requires us to scan it, but the scan can be fast since it does not need to compute the value, it merely looks for a decimal separator or an exponent suffix. So the first scan could check where the numbers are and, if needed, just check which ones are integers. This first pass would not require parsing the numbers at all.
Even if we keep the three "number types" as currently, we could, instead of parsing numbers two times, just write the number instances to a temporary buffer and thus avoid parsing the numbers twice. In effect, we could, the first time we parse a number, just memorize the value. Then the second time you go through the document, you'd simply access the number buffer and recovered the already parsed numbers.

lemire commented 2 years ago

@eddelbuettel Please let me know if my sketch is clear enough. I am trying to make sure that you and others who do not know simdjson as well as I do fully understand the issue. The issue is not hard to understand or complicated but it is possible that I do not explain myself well. Feedback and questions are invited. We can videoconference if you'd like me to explain it in person.

eddelbuettel commented 2 years ago

@lemire thanks -- wondering though who the audience for this is. Typical users of the package just call RcppSimdJson::fparse(...) or fload(...) and ignore internals. So ... should we place this next to the header file in case someone wanted to program at that level against it? (We ship the header so someone could yet I doubt anybody does...).

The package does not yet have a vignette. We could add it there too under 'technical memo' or something...

lemire commented 2 years ago

@eddelbuettel I thought about a GitHub issue actually.

See https://github.com/eddelbuettel/rcppsimdjson/issues

eddelbuettel commented 2 years ago

Ok, that folder has four open and 38 closed ones. Anything in particular, apart from the obvious desiderata of catching up with upstream you and @NicolasJiaxin so kindly took care of :)

Or do you mean you want to file it there as an issue? Works for me too.

lemire commented 2 years ago

Or do you mean you want to file it there as an issue? Works for me too.

That is what I mean.

I can file it right now myself if you'd like but it makes most sense if you decide to merge the current PR. Or you can just copy my proposal and file it yourself at your leisure.

lemire commented 2 years ago

@NicolasJiaxin @eddelbuettel I shall be proposing something tomorrow that can accelerate the wrapper without much effort. Wait 24 hours for me.

lemire commented 2 years ago

@NicolasJiaxin Ok. I think this PR could be made faster and simpler with the following PR upstream: https://github.com/simdjson/simdjson/pull/1713

(I realize you are busy Nicolas... if you can't get to it, I can probably hack your PR... but I suspect that the required changes are tiny.)

knapply commented 2 years ago

Apologies for being silent. Not much time to devote to this at the moment.

A few quick thoughts to consider: 1) The number dynamism we originally provided was to compensate for limitations in R (and thus the other JSON parsing packages), namely the lack of native 64-bit or unsigned integers. To simplify the code in a way that better cooperates with On-Demand, we could simply have the default setting parse all numbers to doubles (like most other R packages seem to do). If memory serves, a 64-bit double can represent a 32-bit integer exactly up to its maximum size (2,147,483,647). If the user doesn't care about exact precision or knows there are no numbers that exceed 2147483647L, they get maximum speed.

2) @eddelbuettel Regarding this comment,

So ... should we place this next to the header file in case someone wanted to program at that level against it? (We ship the header so someone could yet I doubt anybody does...).

I link to {RcppSimdJson} relatively frequently for some things where I know the schema upfront. I think this would be the most common use case, but -- so far -- it always involves writing a fair amount of Rcpp code that populates data frames using just the simdjson header files. The {RcppSimdJson} header files aren't very useful if you know the input data schema and exactly what you want for the outputted R object.

lemire commented 2 years ago

If memory serves, a 64-bit double can represent a 32-bit integer exactly up to its maximum size (2,147,483,647).

You can represent exactly all 53-bit integers... so it much better than that!!! In R, it seems that by default, you have 32-bit integers and 64-bit floats. I am not sure why it was done that way because the floats can represent (exactly) the 32-bit integers. But then I am no R expert.

My latest PR to simdjson should make it less expensive to support the current rcppsimdjson behaviour. What my PR (upstream) does is scan quickly the number string and classify it into integer, float and unsigned integer... but without computing the number. We still scan everything twice, but that's maybe unavoidable and not so bad. So if we update the PR to take into account my upstream PR, I think I am going to be happy intellectually with this (rcppsimdjson) pull request. That is, the code is not going to be "obviously inefficient". Indeed, the current PR actually does compute (fully) all numbers twice. I don't even know if it is particularly inefficient, but it gives me acne just to think about it.

eddelbuettel commented 2 years ago

R & R created R in Auckland back in the very early nineties. And the storage difference between 32 bit and 64 bit mattered a lot at the time (and still does today for some apps -- we "always run out of memory eventually"). I do not know what S did internally but R does everything around a 'S Expression Pointer' aka SEXP with a (by choice) very limited set of types. So here we are.

(It has its benefits. A Python-savvy colleague of mine was amazed that we can serialize and deserialize any R object without any extra fuzz. Upside of a simple SEXP ...)

lemire commented 2 years ago

And the storage difference between 32 bit and 64 bit mattered a lot at the time (and still does today for some apps -- we "always run out of memory eventually").

Ah. That's an excellent answer.

NicolasJiaxin commented 2 years ago

@eddelbuettel @lemire I cleaned up the code so that we do not parse/compute numbers unnecessarily using the PR @lemire mentioned (faster get_number_type). I also cleaned up a little bit scalar.hpp because I think sometimes, numbers were parsed twice instead of once. The expectations is that it should be faster, but I will let numbers show if that is the case.

eddelbuettel commented 2 years ago

Thanks for the update. (And there are once again a few stray ; we'd want to take out but I could also do that post merge).

It looks like it did not move the needle much:

We are still faster than the alternatives.

> res
Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval   cld
 simdjson  4.58040  4.87331  5.23455  5.05407  5.31515  7.95953   100 a    
  jsonify  9.10063  9.60060 10.30886  9.87205 10.37118 20.18826   100  b   
  RJSONIO 11.81913 12.46425 13.30523 13.11117 13.92427 16.57848   100   c  
   ndjson 28.42622 30.51445 34.18605 32.91319 34.62468 87.20684   100    d 
 jsonlite 37.03524 40.17538 42.28813 41.57711 43.03116 84.68678   100     e

>

but possibly (anecdotaly) slower than (the possibly earlier/rawer) version I used for the chart in the repo README.md. So to be a little more rigorous, I just went back to CRAN version 0.1.5 which is indeed faster:

> res
Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval   cld
 simdjson  2.26443  2.58361  2.87319  2.72057  2.88002  7.10744   100 a    
  jsonify  9.04628  9.60193 10.17232  9.86265 10.45408 15.11609   100  b   
  RJSONIO 12.12393 12.81666 14.12716 13.42787 14.11360 59.02396   100   c  
   ndjson 28.29516 31.15533 33.98465 33.11274 35.29460 65.32814   100    d 
 jsonlite 36.52221 40.31987 42.60612 41.82920 43.40991 90.98211   100     e

>

lemire commented 2 years ago

@eddelbuettel These are very disappointing numbers.

I do not know what could explain a 2x regression. In every single one of our tests, On Demand is at least a bit faster than DOM.

There is nothing I can imagine we might do upstream at this time.

It could be interesting to compare the performance with the other PR which is merely a minor API update: https://github.com/eddelbuettel/rcppsimdjson/pull/70

eddelbuettel commented 2 years ago

Hm, I had thought #70 was held back by segfaults? I guess not. My bad -- I missed that.

Switching to that is closer to what we had before:

> res
Unit: milliseconds
     expr      min       lq     mean   median       uq     max neval   cld
 simdjson  2.27637  2.57725  2.97962  2.78669  3.07431 13.1829   100 a    
  jsonify  9.18581  9.60845 10.35618  9.91750 10.37521 19.6375   100  b   
  RJSONIO 11.90307 12.53867 13.45624 13.20775 13.93215 19.1741   100   c  
   ndjson 28.38733 31.42773 34.14512 33.35089 35.65473 67.3227   100    d 
 jsonlite 36.37115 40.16429 44.09259 42.50186 44.87049 85.7617   100     e

>

(By the way this just involves installing the respective RcppSimdJson version and running demo(simpleParseBenchmark, package="RcppSimdJson", ask=FALSE). If you do not have some or all of the other candidates just comment them out in demo/simpleParseBenchmark.R)

NicolasJiaxin commented 2 years ago

The benchmark used is twitter.json which is returned as a dataframe, right? If that is the case, that could possibly/partly explain why On Demand is slower since dataframes are not optimally built and many resets are used.

eddelbuettel commented 2 years ago

Correct. It is just calling fparse() pointed at the twitter dataset. We generally return data.frame objects which (internally) are just list of (column) vectors (with possibly varying types). It is quite possible that the parsing is now faster but that we (for now?) do something less optimal constructing vectors.

lemire commented 2 years ago

Yes. There might be an architectural issue.

NicolasJiaxin commented 2 years ago

Then, in that case it would be interesting how this PR (and the earlier versions) do in another benchmark. Over on the simdjson repo, we do have many benchmarks, but I do not know if there are all suitable for R parsing. Maybe @lemire has a suggestion..

eddelbuettel commented 2 years ago

Oh it's easy we just re-taylor it. Point to a (large enough) JSON corpus, measure just parse. Reiterate on other aspects.

We also have a r + c++ sampling profiler by a Canadian pal we could let loose (on suitable operating systems such as the one I use): https://github.com/atheriel/xrprof

lemire commented 2 years ago

Intuitively, there might been an issue having to do with the fact that the wrapper we designed for a DOM API... There might also be a performance bug in how the new wrapper is written.

lemire commented 2 years ago

A good strategy to gain understanding might be to try different inputs.

eddelbuettel / rcppsimdjson

[PROTOTYPE] Updating to simdjson On Demand (~version 1.0) #75

Codecov Report