elastic / rally

Macrobenchmarking framework for Elasticsearch
Apache License 2.0
1.95k stars 313 forks source link

Add support for zstd-compressed corpora #1781

Closed danielmitterdorfer closed 1 year ago

danielmitterdorfer commented 1 year ago

Rally supports various compression formats such as gz or bzip. It does not support the zstd format which is perfoming significantly better in disk usage and decompression speed in my experiments. I've compressed 183GB corpus with pbzip2 and pzstd, both with the maximum compression level that is supported by the respective tool.

Format Size on disk [GB] Size on disk [GB] Relative size [%]
bzip 18613471805 18 100
zstd 11215205385 11 60

Also decompression speed is vastly superior (times measured with time, table contains the output of real, i.e. wall clock time):

Format Time to decompress [s] Relative time [%]
bzip 388 100
zstd 144 36

Therefore I propose to add support for zstd compression to Rally similar to bzip support: The fast option would require pzstd to be on PATH and a fallback can be based on the Python zstd implementation.

For reference: