fastbuild / fastbuild

High performance build system for Windows, OSX and Linux. Supporting caching, network distribution and more.
https://fastbuild.org
1.23k stars 378 forks source link

Allow zstd as another compression library ? #982

Open avudnez opened 1 year ago

avudnez commented 1 year ago

Hi,

We're wondering if there is interest in being able to use zstd instead of lz4. Since the introduction of lz4 in fastbuild, zstd was born and got very fast. I compared the two on a representative set of data with object files and preprocessed cpp files from Unreal Engine.

In production, we currently use compression level 6, so I compared similar levels with both compressors. This is a basic comparison, I've not investigated every option and tuning available.

### Compressing preprocessed sources ###

lz4
Benchmarking levels from 3 to 6
 3#preprocessed.tar  : 499548160 ->  81232549 (6.150), 257.7 MB/s ,5820.7 MB/s
 4#preprocessed.tar  : 499548160 ->  79850813 (6.256), 217.1 MB/s ,5864.9 MB/s
 5#preprocessed.tar  : 499548160 ->  79002372 (6.323), 176.1 MB/s ,5898.8 MB/s
 6#preprocessed.tar  : 499548160 ->  78577426 (6.357), 141.6 MB/s ,5931.8 MB/s

zstd
 3#preprocessed.tar  : 499548160 ->  66413938 (x7.522),  603.3 MB/s, 2629.2 MB/s
 4#preprocessed.tar  : 499548160 ->  66409041 (x7.522),  586.7 MB/s, 2626.9 MB/s
 5#preprocessed.tar  : 499548160 ->  59095949 (x8.453),  230.0 MB/s, 2742.3 MB/s
 6#preprocessed.tar  : 499548160 ->  56596397 (x8.827),  180.6 MB/s, 2907.2 MB/s

### Compressing object files ###

lz4
Benchmarking levels from 3 to 6
File(s) bigger than LZ4's max input size; testing 2016 MB only...
 3#clang-editor.tar  :2113929216 -> 585036887 (3.613), 196.9 MB/s ,5471.5 MB/s
 4#clang-editor.tar  :2113929216 -> 577446676 (3.661), 165.7 MB/s ,5521.5 MB/s
 5#clang-editor.tar  :2113929216 -> 573084453 (3.689), 135.5 MB/s ,5623.1 MB/s
 6#clang-editor.tar  :2113929216 -> 570342481 (3.706), 109.0 MB/s ,5668.8 MB/s

zstd
 3#clang-editor.tar  :2133340160 -> 423754407 (x5.034),  517.2 MB/s, 2008.8 MB/s
 4#clang-editor.tar  :2133340160 -> 421506636 (x5.061),  456.6 MB/s, 2008.8 MB/s
 5#clang-editor.tar  :2133340160 -> 406113167 (x5.253),  211.0 MB/s, 2008.8 MB/s
 6#clang-editor.tar  :2133340160 -> 392767071 (x5.432),  168.1 MB/s, 2168.0 MB/s

Basically what I get from this is that zstd is able to achieve much better compression ratios at quite higher speeds, but it's around 2x to 3x slower to decompress. However we could argue that decompression speed is not the limiting factor in build speed (even considering the most decompression-intensive case of a full cache hit) when preprocessing is orders of magnitude slower, and we're still talking about 2+GB/s which is faster than most NVMe SSDs anyways.

Do you think this is something worth pursuing ?

ffulin commented 1 year ago

I think these numbers are pretty interesting. The way compression is used is quite nuanced (there are 4 or 5 different scenarios) and there are some different tradeoffs in those scenarios (mostly related to network bandwidth, but a few others).

As a quick test, I've integrated Zstd into FASTBuild and have generated numbers for some the main use-cases at varying levels of compression and am working on putting together my thoughts over what could potentially be changed (i think there are some cases that on the whole might be improved while others might be a bit harder to call one way or the other).

I'll provide some details of my findings soon.

ffulin commented 6 months ago

I have finally been able to come back to this.

TL;DR:

Details:

I performed a fairly detailed analysis of all the various scenarios in which compression time, decompression time and network bandwidth interact. I ended up making a somewhat elaborate spreadsheet to try and simulate some of the scenarios that I couldn't test directly due to lack of hardware (like 40 Gbps network connections :) )

My conclusion is that:

With that in mind, I've started transitioning the various compression use-cases over to use Zstd by default with some appropriately tuned defaults for the 1/10 Gbps scenario. For users with atypical use-cases, the defaults can still be overridden and the existing LZ4 implementation is still available for the negative values to reduce CPU use.

The first of these changes is to obj file compression, which impacts the following scenarios:

The new cache compression settings are now as follows, with timings from an x64 obj file from FASTBuild:

File           : Tools/FBuild/FBuildTest/Data/TestCompressor/TestObjFile.o
Size           : 4328135
             Compression             Decompression
     Level | Time (ms)  MB/s  Ratio | Time (ms)  MB/s
-------------------------------------------------------
None:  0    |    0.934  4419.2  1.00 |    0.419  9854.7 (memcpy)
LZ4:  -256  |    1.427  2893.2  1.24 |    0.597  6919.5
      -128  |    2.048  2015.1  1.42 |    0.775  5326.7
      -64   |    2.662  1550.7  1.70 |    0.938  4400.1
      -32   |    3.044  1356.1  2.06 |    1.075  3838.6
      -16   |    3.455  1194.8  2.36 |    1.179  3501.3
      -8    |    4.252   970.9  2.80 |    1.219  3387.3
      -4    |    4.386   941.1  3.09 |    1.252  3296.2
      -2    |    4.534   910.3  3.23 |    1.316  3137.5
      -1    |    4.646   888.4  3.29 |    1.346  3066.0 <<-- old default
Zstd:  1    |    4.810   858.2  5.35 |    3.287  1255.7 <<-- new default (replaces level 1 above)
       3    |    6.954   593.6  5.49 |    3.354  1230.5
       6    |   20.984   196.7  5.79 |    3.241  1273.7
       9    |   31.578   130.7  6.00 |    3.300  1250.9
       12   |   73.574    56.1  6.03 |    3.209  1286.4

Compression time with these new defaults is about the same, but the compression ratio is significantly improved. The size of the compressed data is reduced by about 40%. Decompression time is slightly slower, but compensated for by reduced bandwidth use.

Some example scenarios:

Compression and upload:

           Old default (LZ4 -1)           New default (Zstd 1)
Network  | Xfer    | Comp   | Total     | Xfer    | Comp   | Total     | Saved
-------------------------------------------------------------------------------
100 Mbps | 3040 ms | 113 ms | 3153 ms   | 1870 ms | 116 ms | 1986 ms   | 2039 ms
1 Gbps   | 304 ms  | 113 ms | 417 ms    | 187 ms  | 116 ms | 303 ms    | 114 ms
10 Gbps  | 34 ms   | 113 ms | 147 ms    | 18 ms   | 116 ms | 134 ms    | 13 ms

Downloading and decompression:

           Old default (LZ4 -1)           New default (Zstd 1)
Network  | Xfer    | Decomp | Total     | Xfer    | Decomp | Total     | Saved
-------------------------------------------------------------------------------
100 Mbps | 3040 ms | 22 ms  | 3062 ms   | 1870 ms | 78 ms  | 1948 ms   | 1114 ms
1 Gbps   | 304 ms  | 22 ms  | 326 ms    | 187 ms  | 78 ms  | 265 ms    | 61 ms
10 Gbps  | 34 ms   | 22 ms  | 56 ms     | 18 ms   | 78 ms  | 96 ms     | -40 ms *

Timings above are for a single CPU core. With a few CPU cores, even the 10 Gbps scenario comes out ahead overall.


As well as improving caching, this improves the speed of returning compressed results from worker. A future change will switch the transfer of preprocessed data to workers to Zstd by default.

Finally, Zstd will likely be used for transferring tool chains to workers.