atorus-research / datasetjson

Read and write CDISC Dataset JSON files
https://atorus-research.github.io/datasetjson/
Apache License 2.0
11 stars 2 forks source link

Feature Request: Speed improvement #32

Open ramiromagno opened 5 months ago

ramiromagno commented 5 months ago

Feature Idea

Depend on rcpp.simdjson or yyjsonr, instead of jsonlite. The link contains a nice benchmark.

Relevant Input

No response

Relevant Output

No response

Reproducible Example/Pseudo Code

No response

mstackhouse commented 5 months ago

@ramiromagno looking into those packages - any suggestion for higher speed performance on write? rcpp.simdjson has read functions but I wasn't seeing as much for writing.

ramiromagno commented 5 months ago

You're right, it seems rcpp.simdjson only has read functions.

yyjsonr looks promising though.

mstackhouse commented 1 month ago

Using my yyjson_switch branch with 2 cores and 16gb of ram in a container:

ae <- read_dataset_json(test_path("testdata", "ae.json"))

ae_100 <- dplyr::bind_rows(rep(list(ae),100000))
ds_metadata <- dplyr::bind_rows(purrr::map(ae, \(x) attributes(x)))
ds_metadata['name'] <- names(ae)

ds_json <-
  dataset_json(ae_100, "SDTM.AE", "AE", "Adverse Events", ds_metadata)

start <- Sys.time()
write_dataset_json(ds_json, file="test.json")
print(Sys.time()-start)
Time difference of 42.58133 secs

In total that's 7,400,000 rows and 37 columns. Total output size is 1.8GB

A quick test against the current dev branch using jsonlite had a time of 2.141051 mins.