A report on improving R serialization performance

traversc commented 3 weeks ago

Introduction

Hi everyone, I thought I'd write up a report summarizing some experiments I've conducted over the past year on how R could improve serialization performance.

Colleagues often express frustration with the slowness of saveRDS or RStudio startup delays caused by save.image and load. However, these processes can be made significantly more efficient. I'm hoping to spark a discussion and gain support for making improvements.

To motivate discussion, here are some benchmarks:

Algorithm	Save Time (s)	Read Time (s)	Compression (x)
saveRDS	35.6	8.45	2.83
base::serialize	3.01	5.90	1.07
qs2 (1 thread)	3.59	4.98	3.12
qs2 (4 threads)	1.66	4.43	3.12

For these tests, I saved a mixed numeric and text dataset of about 1 GB. saveRDS was used with default settings, base::serialize was used with no compression and XDR = FALSE, and qs2 utilized the R_serialize C API with a ZSTD block compression scheme.

Below, I've outlined several areas where I believe improvements can be made, roughly ordered from lowest to highest hanging fruit. Please let me know if you have any differing opinions or if I've overlooked something.

(On a side note, I would like to understand whether R_serialize is part of the stable C API or whether it can be added in. I counted at least 16 packages on CRAN that are using this interface. It is not in the list of excluded non-API functions and R CMD check does not check for it. Any insights would be appreciated.)

List of things that could be improved

Compression algorithm

Introducing ZSTD as an option would be a significant enhancement. Major tech services like AWS and Chrome have now adopted ZSTD compression, so it seems like it would be a worthy addition to R.

File IO

R serialization via saveRDS uses an IO buffer of 16384 bytes. On most systems I've tested, this small buffer size adds overhead. The buffer is also used for gzip and it might also be too small for efficient compression.

Below is a plot of writing 1 GB of data to a raw C file descriptor on disk. You can see that small buffers decrease performance.

mac_fd_blocksize_test

XDR

This has been discussed in R-devel, but an option for setting XDR in saveRDS would be much appreciated. For raw numeric data, base::serialize with XDR = FALSE is an improvement in speed by a whopping factor of 3.

Byte shuffling

Byte shuffling enhances both speed and compression for numeric data. A description of byte shuffling is discussed in this StackExchange post. This can be heuristically applied to blocks of data without much overhead.

Below is a comparison plot of a numeric dataset with and without byte shuffling.

byte_shuffling_comparison

Multithreading

This is a hard one to get right. Most data serialization libraries do not take full advantage of multithreading and only multithread compression after serialization of a block of data. Ideally, compression and IO should occur asynchronously to see significant benefits, especially during deserialization.

String serialization

Serialization format doesn't matter as much as I once thought, but string handling is one place I think there is room for improvement.

Each string currently incurs 64 bits of overhead: the first 32 are used only for encoding and type and the last 32 are used for the length of the string. If you have a large text dataset, this overhead becomes substantial.

For a character vector (STRSXP), storing type information is unnecessary since each element is a CHARSXP. Encoding requires only 3 bits, and most strings do not need a full 32 bits for size. There are numerous ways this overhead could be reduced.

Conclusion

I believe these areas present opportunities for improvement and I hope this overview helps. I'd appreciate any feedback or additional insights.

shikokuchuo commented 3 weeks ago

(On a side note, I would like to understand whether R_serialize is part of the stable C API or whether it can be added in. I counted at least 16 packages on CRAN that are using this interface. It is not in the list of excluded non-API functions and R CMD check does not check for it. Any insights would be appreciated.)

R_Serialize and R_Unserialize are currently marked as experimental API. The designations are incomplete and being added, and can be tracked using the front end created by @yutannihilation here: https://yutannihilation.github.io/R-fun-API/ Having said that, for me it wouldn't make much sense for these functions to leave the API unless there were alternatives already in place.

yutannihilation commented 3 weeks ago

Oh, it's only a few days ago when these APIs got marked as experimental API. Now WRE has a section about serialization. Great.

https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Custom-serialization-input-and-output-1

commit: https://github.com/r-devel/r-svn/commit/0b1eeb496d5ba87634a4ced4562d2b016e500aac

RConsortium / marshalling-wg