eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library
114 stars 13 forks source link

clarification on the query parameter #77

Closed JosiahParry closed 1 year ago

JosiahParry commented 1 year ago

First off, I want to say how come I haven't heard of this package earlier? Insane speed improvements over any other json parsing library I've encountered. One thing I am particularly interested in is the query parameter. The use case is I have a geojson file that I want to extract everything but the geometry. I'm not able to understand how I can use the query parameter so that I can improve the performance by not parsing the geometry field in each feature. Is there documentation on the type of syntax that should be used?

{
   "type":"FeatureCollection",
   "features":[
      {
         "type":"Feature",
         "properties":{
            "id":1
         },
         "geometry":{
            "type":"Point",
            "coordinates":[
               1,
               6
            ]
         }
      },
      {
         "type":"Feature",
         "properties":{
            "id":2
         },
         "geometry":{
            "type":"Point",
            "coordinates":[
               2,
               7
            ]
         }
      },
      {
         "type":"Feature",
         "properties":{
            "id":3
         },
         "geometry":{
            "type":"Point",
            "coordinates":[
               3,
               8
            ]
         }
      },
      {
         "type":"Feature",
         "properties":{
            "id":4
         },
         "geometry":{
            "type":"Point",
            "coordinates":[
               4,
               9
            ]
         }
      },
      {
         "type":"Feature",
         "properties":{
            "id":5
         },
         "geometry":{
            "type":"Point",
            "coordinates":[
               5,
               10
            ]
         }
      }
   ]
}
JosiahParry commented 1 year ago

Update: i realized that JSON pointers have their own syntax which seems exceptionally limited. So I don't think it's possible.

eddelbuettel commented 1 year ago

Yes we are not doing any magic -- we are simply asking the rather magical simdjson library to parse for us, and it does its thing relative to the JSON spec. So ... ok to close?

lemire commented 1 year ago

@JosiahParry

I can improve the performance by not parsing the geometry field in each feature. Is there documentation on the type of syntax that should be used?

You definitively can do this in C++ using our main API.

...

More generally, if there was some kind of syntax/query language... where you can load a JSON document selectively, that would be great... but it may be harder to design that it sounds.

eddelbuettel commented 1 year ago

Right, but we don't currently expose that. So, as the saying goes, "patches welcome".

JosiahParry commented 1 year ago

@lemire if I could even fake my way around C++ I would try, but I can't :) Regardless this is blazingly fast and memory efficient so i dont feel too bad just throwing out the geometry after parsing it

# A tibble: 4 × 13
  expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
  <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
1 jsonlite::fromJSON(res)      2.15s    2.15s     0.464    48.8MB    1.39      1     3      2.15s <NULL> <Rprofmem>
2 rjson::fromJSON(res)         1.17s    1.17s     0.852    54.9MB    0         1     0      1.17s <NULL> <Rprofmem>
3 jsonify::from_json(res)      8.23s    8.23s     0.122    25.8MB    0.243     1     2      8.23s <NULL> <Rprofmem>
4 RcppSimdJson::fparse(res)  73.34ms  76.71ms    11.7        14MB    1.95      6     1   511.56ms <NULL> <Rprofmem>
eddelbuettel commented 1 year ago

(PSA for @lemire: That is output from a somewhat "special" benchmarking package which opines that mixing different units in the same column is a good idea.)