IQSS / dataverse-client-r

R Client for Dataverse Repositories
https://iqss.github.io/dataverse-client-r
61 stars 25 forks source link

Faster JSON parser #112

Open kuriwaki opened 2 years ago

kuriwaki commented 2 years ago

For dataset retrieval, we download and parse JSON metadata multiple times. For example, in get_dataframe_by_name, get_fileid.character would first find the dataset id via https://github.com/IQSS/dataverse-client-r/blob/4775a92360569adb6e693ad6db940f89530eeb8d/R/utils.R#L29 and the file the list of ids for each file in the dataset at https://github.com/IQSS/dataverse-client-r/blob/4775a92360569adb6e693ad6db940f89530eeb8d/R/get_dataset.R#L101

It turns out the time this takes is non-trivial. Most of the time is taken up by loading the JSON from URL. A small remaining fraction (< 1%) is due to the parsing of the JSON file. We could make a minor improvement in speed by switching to a faster parser, RcppSimdJson (https://github.com/eddelbuettel/rcppsimdjson). This is about 2-10x faster in my tests, per below. The current jsonlite::fromJSON seems to be optimal for data science pipelines where we deal with data but here we are only interested in bits of metadata. An even faster switch is to download metadata only once.

Switching packages will require changes in at least 20 places where jsonlite is used.

library(jsonlite) # currently used
library(RcppSimdJson) # potential replacement

# sample: https://demo.dataverse.org/file.xhtml?persistentId=doi:10.70122/FK2/PPIAXE/MHDB0O
js_url <- "https://demo.dataverse.org/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.70122/FK2/PPIAXE"

# download once
tmp <- tempfile()
download.file(js_url, tmp)

microbenchmark::microbenchmark(
  statusquo = jsonlite::fromJSON(js_url), # what is currently being called
  dl = curl::curl_download(js_url, tempfile()), # separating download from parsing
  jsonlite = jsonlite::fromJSON(tmp),  # parsing, without download
  RcppJson = RcppSimdJson::fload(tmp), # replace with Rcpp
  RcppJson_file = RcppSimdJson::fload(tmp, query = "/datasetVersion/files"), # only files data
  RcppJson_id = RcppSimdJson::fload(tmp, query = "/id"),  # stop at dataset /id
  times = 30
)
#> Unit: microseconds
#>           expr        min         lq        mean      median         uq        max neval
#>      statusquo 365097.709 371235.626 374774.8021 373752.4175 378357.084 387006.459    30
#>             dl 361154.168 364100.750 369091.1201 369528.3965 371835.459 378629.667    30
#>       jsonlite   1487.834   2743.500   3248.0424   2994.1465   3270.959   8380.876    30
#>       RcppJson    186.876    262.001    438.1298    345.3130    468.042   2335.417    30
#>  RcppJson_file    136.292    224.001    334.5173    301.6465    409.376    688.001    30
#>    RcppJson_id    138.459    177.876    287.7714    263.3965    362.792    586.750    30

Created on 2022-01-05 by the reprex package (v2.0.1)

wibeasley commented 2 years ago

Wow, that's a big difference. I like how your first two benchmarks isolate downloading vs parsing.

If it helps, here are the results from my home desktop; I have fiber internet with a ~5 year old processor. I have the same 7.2x jump for parsing as you.

Unit: microseconds
          expr        min         lq        mean      median         uq        max neval cld
     statusquo 170548.243 178850.316 182609.5076 182405.6315 186942.899 193960.856    30   c
            dl 168463.594 176699.861 181490.2084 181860.3195 185492.156 195547.737    30   c
      jsonlite   2786.174   3237.758   4953.9265   5831.6465   5977.747   6232.820    30  b 
      RcppJson    349.556    478.450    684.4094    740.9170    895.923   1018.478    30 a  
 RcppJson_file    294.919    548.481    635.1559    642.5110    771.255    884.151    30 a  
   RcppJson_id    256.885    531.879    553.2711    558.8955    597.987    795.705    30 a 

Even though a 7-10x jump is nice, I'm not sure it will be noticed by the user. The real bottleneck is downloading the file (ie, 0.3 sec for you and 0.18sec for me). The parsing duration is a fraction of the downloading duration.


This is probably more trouble than it's worth, but I'm thinking aloud: Are the two packages essentially interchangeable? I mean, do they accept the same parameter (ie, url) and spit out a nested list with the exact same structure?

If so, could the dataverse package use RcppSimdJson if it's available (using requireNamespace("RcppSimdJson ", quietly = TRUE)) and fall back to jsonLite if it's not available?

RcppSimdJson could be a suggests; this approach is explained well in the "Guarding the use of a suggested package" section of the 2nd edition of R Packages.

I'm a little concerned RcppSimdJson is not easily deployed. The RcppSimdJson library has only three dependencies in the past two years. I see the current minimum requirements of jsonLite are almost nothing (R won't even work without the methods package. The suggested dependencies are almost identical to dataverse -the sf package is the only real addition.

kuriwaki commented 2 years ago

Overall, I think we can start with the parallel Suggests, but given that download as opposed to parsing is the real bottleneck, it is not a high priority.

Re:

Are the two packages essentially interchangeable? I mean, do they accept the same parameter (ie, url) and spit out a nested list with the exact same structure?

They are not identical (their object.size is slightly off and don't pass the base::identical()) but I can't find a difference yet, and it might be identical in the aspects our client package cares about.

Re:

I'm a little concerned RcppSimdJson is not easily deployed. The RcppSimdJson library has only three dependencies in the past two years.

I thought we want to depend on packages which in turn have fewer dependencies themselves? Right that jsonlite has no real dependencies. RcppSimd is also minimal too, but it does rely on Rcpp.

Re:

I'm not sure it will be noticed by the user. The real bottleneck is downloading the file

Yes, maybe we tackle the download first.

mtmorgan commented 2 months ago

I think yyjsonr is actually faster these days (although it could be challenging to get identical results due to how data.frame etc are simplifed)

microbenchmark::microbenchmark(
  jsonlite = jsonlite::fromJSON(tmp)[["datasetVersion"]][["files"]],
  RcppJson = RcppSimdJson::fload(tmp, query = "/datasetVersion/files"),
  yyjsonr = yyjsonr::read_json_file(tmp)[["datasetVersion"]][["files"]],
  times = 50
)
## Unit: microseconds
##      expr      min       lq      mean    median       uq      max neval
##  jsonlite 2206.128 2223.799 2336.2784 2248.3990 2369.021 4097.622    50
##  RcppJson   97.375   98.605  104.7312  103.2995  106.313  164.000    50
##   yyjsonr   68.388   70.356   72.6192   71.1145   72.324  125.214    50

A very cheap way to avoid repeatedly paying the download cost is 'memoize' httr::GET(), e.g.,

GETm <- NULL
.onLoad <- function(...) {
    GETm <<- memoise::memoise(httr::GET)
}

and then use GETm() instead of httr::GET() in the code. A memoised function has its result stored (for each unique set of arguments) locally, so the API call is only executed once, subsequent calls are retrieved from a (in memory or on-disk) cache

> system.time(GETm(js_url))
   user  system elapsed 
  0.021   0.004   0.535 
> system.time(GETm(js_url))
   user  system elapsed 
  0.014   0.000   0.014 

Invoking GETm on a different URL creates a different cache entry.

This works well when the URL always returns the same value, as I think it does for (most?) calls in the client; it would not work well for, e.g., live stock quote updates, where the memoised value would always be the first value.

For code that looks like

https://github.com/IQSS/dataverse-client-r/blob/8fe5184e879c47e6ba27ce06fbc30c39738433d4/R/get_file_metadata.R#L51-L53

it would make sense to refactor this so the code chunk is itself a memoised function -- checking for status and extracting the content is a one-time operation for each u, key pair

The type, location, and properties (maximum size, expiration time) are controlled by the cache= argument to memoise. It might make sense to create separate memoised functions for different use cases (e.g.., file downloads might be cached to disk with fairly large memory; JSON metadata might be cached in-memory with shorter retention times).

If this seems interesting I could prepare a pull request.

pdurbin commented 2 months ago

@mtmorgan awesome that you're willing to dive into the code. As you may have seen, we are looking for a new maintainer: https://github.com/IQSS/dataverse-client-r/issues/21#issuecomment-2166131788

kuriwaki commented 2 months ago

Does yyjson depend on the caching/memoise? Or are those two things independent.

@beniaminogreen had started trying to allow caching in #135. Other than that note, a pull request sounds great.

mtmorgan commented 2 months ago

yyjson would be a jsonlite replacement, independent of memoize; I agree with the above that parsing JSON per se is probably not that limiting in terms of performance.

Thanks for pointing to issue 135; probably I would suggest generalizing caching by working at a lower level (the GET request), so will explore that in light of 135.