Open traversc opened 3 weeks ago
(On a side note, I would like to understand whether R_serialize is part of the stable C API or whether it can be added in. I counted at least 16 packages on CRAN that are using this interface. It is not in the list of excluded non-API functions and R CMD check does not check for it. Any insights would be appreciated.)
R_Serialize
and R_Unserialize
are currently marked as experimental API. The designations are incomplete and being added, and can be tracked using the front end created by @yutannihilation here: https://yutannihilation.github.io/R-fun-API/ Having said that, for me it wouldn't make much sense for these functions to leave the API unless there were alternatives already in place.
Oh, it's only a few days ago when these APIs got marked as experimental API. Now WRE has a section about serialization. Great.
https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Custom-serialization-input-and-output-1
commit: https://github.com/r-devel/r-svn/commit/0b1eeb496d5ba87634a4ced4562d2b016e500aac
Introduction
Hi everyone, I thought I'd write up a report summarizing some experiments I've conducted over the past year on how R could improve serialization performance.
Colleagues often express frustration with the slowness of
saveRDS
or RStudio startup delays caused bysave.image
andload
. However, these processes can be made significantly more efficient. I'm hoping to spark a discussion and gain support for making improvements.To motivate discussion, here are some benchmarks:
For these tests, I saved a mixed numeric and text dataset of about 1 GB.
saveRDS
was used with default settings,base::serialize
was used with no compression andXDR = FALSE
, and qs2 utilized theR_serialize
C API with a ZSTD block compression scheme.Below, I've outlined several areas where I believe improvements can be made, roughly ordered from lowest to highest hanging fruit. Please let me know if you have any differing opinions or if I've overlooked something.
(On a side note, I would like to understand whether
R_serialize
is part of the stable C API or whether it can be added in. I counted at least 16 packages on CRAN that are using this interface. It is not in the list of excluded non-API functions andR CMD check
does not check for it. Any insights would be appreciated.)List of things that could be improved
Compression algorithm
Introducing ZSTD as an option would be a significant enhancement. Major tech services like AWS and Chrome have now adopted ZSTD compression, so it seems like it would be a worthy addition to R.
File IO
R serialization via
saveRDS
uses an IO buffer of 16384 bytes. On most systems I've tested, this small buffer size adds overhead. The buffer is also used forgzip
and it might also be too small for efficient compression.Below is a plot of writing 1 GB of data to a raw C file descriptor on disk. You can see that small buffers decrease performance.
XDR
This has been discussed in R-devel, but an option for setting XDR in
saveRDS
would be much appreciated. For raw numeric data,base::serialize
withXDR = FALSE
is an improvement in speed by a whopping factor of 3.Byte shuffling
Byte shuffling enhances both speed and compression for numeric data. A description of byte shuffling is discussed in this StackExchange post. This can be heuristically applied to blocks of data without much overhead.
Below is a comparison plot of a numeric dataset with and without byte shuffling.
Multithreading
This is a hard one to get right. Most data serialization libraries do not take full advantage of multithreading and only multithread compression after serialization of a block of data. Ideally, compression and IO should occur asynchronously to see significant benefits, especially during deserialization.
String serialization
Serialization format doesn't matter as much as I once thought, but string handling is one place I think there is room for improvement.
Each string currently incurs 64 bits of overhead: the first 32 are used only for encoding and type and the last 32 are used for the length of the string. If you have a large text dataset, this overhead becomes substantial.
For a character vector (STRSXP), storing type information is unnecessary since each element is a CHARSXP. Encoding requires only 3 bits, and most strings do not need a full 32 bits for size. There are numerous ways this overhead could be reduced.
Conclusion
I believe these areas present opportunities for improvement and I hope this overview helps. I'd appreciate any feedback or additional insights.