eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library
116 stars 13 forks source link

[PROTOTYPE] Updating to simdjson On Demand (~version 1.0) #75

Closed NicolasJiaxin closed 2 years ago

NicolasJiaxin commented 3 years ago

This PR is only a prototype. I gladly invite people to rewrite the code or redesign the code.

The goal of this PR was to see if simdjson On Demand API was mature enough to replace the DOM API currently used for RcppSimdJson. This PR provides a working prototype for this. For those unfamiliar with the On Demand API, here is a brief comparison of the two approaches:

Given a JSON document, DOM first scans the document to detect structural indexes (things such as openings and closing brackets, etc.). Then, it constructs a tape that represents the structure of the document. On the other hand, On Demand also starts by indexing the documents (exactly the same code as DOM). However, it does not construct a tape. Instead, it uses those structural indexes to navigate/jump around. This has a few constraints though:

Now, regarding RcppSimdJson, the design is to scan through the JSON document to determine the type of the values in it. Based on the structure it scanned, it tries to return the JSON first as a dataframe, then a matrix, then a vector, and finally as a list. Here is a list of changes I brought:

Other than that, the design is pretty much the same. I do not know if this is the best approach for On Demand, but as I said, the goal of this PR was only to produce a working prototype. Contributions are welcomed to make a working, optimized version when simdjson v1.0 is released (very soon).

lemire commented 3 years ago

@eddelbuettel Note that both PRs offer a path forward. I would prefer On Demand as I expect it would provide better performance opportunities ultimately. But we will continue to support both DOM and On Demand.

lemire commented 3 years ago

Note that twitter has few numbers. It is mostly made of strings. I cannot understand why On Demand would do poorly in such a case since there is no duplicated processing when strings are involved.

eddelbuettel commented 3 years ago

My preference is to do what makes our life easiest in the longer run. On Demand seems like the way to go; I don't consider the (temporary?) regression in performance a show stopper and we have 'enough of it' to allow for this.

But it is your code. If you'd rather tweak it more that is fine by me too. Marginal preference to committing 'something' to also clean up towards 'only C++11 used here'. So maybe ... I should commit #70 first? I'm moderately easy ...