Open danlark1 opened 1 month ago
Friendly ping
Some holistic decompression speed benchmarks to begin this analysis,
resulting in a pretty long list of measurements,
comparing the decompression speed of this PR with dev
on an i7-9700k (~skylake) @3.6GHz :
___PR 4047___ | ___dev___
compile zstd with gcc-7 │compile zstd with gcc-7
5ed291d2ce0db23f3e21cdce9ad8ab1c zstd │b5fbfb378ec0cab50bb650dc0d3f17f6 zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 710.2 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 723.9 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 870.7 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 881.7 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 388.0 MB/s 1172.8 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 387.8 MB/s 1130.9 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 287.8 MB/s, 1029.9 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 285.8 MB/s, 995.4 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 212.4 MB/s, 1000.0 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 211.3 MB/s, 977.4 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 151.6 MB/s, 794.2 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 150.7 MB/s, 785.4 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 107.4 MB/s, 967.6 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 107.5 MB/s, 949.0 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 84.3 MB/s, 757.6 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 84.0 MB/s, 750.8 MB/s
compile zstd with gcc-8 │compile zstd with gcc-8
ffdfbe49f85bdaded7d6c9fa9a28a36e zstd │38acac6630368e177868a8c7062b38b5 zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 671.3 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 730.9 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 892.7 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 886.7 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 381.2 MB/s, 1141.4 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 384.1 MB/s 1117.5 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 281.1 MB/s, 989.9 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 284.9 MB/s, 975.1 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 203.7 MB/s, 983.2 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 210.9 MB/s, 980.2 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 144.7 MB/s, 780.7 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 150.4 MB/s, 796.4 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 98.6 MB/s, 971.5 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 102.6 MB/s, 961.9 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 79.0 MB/s, 765.6 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 81.1 MB/s, 769.1 MB/s
compile zstd with gcc-9 │compile zstd with gcc-9
fe7174be9085dd2de14f2313f53f38eb zstd │7552b0103fd0ac632503026e2f350928 zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 742.9 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 658.3 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 860.9 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 918.3 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 376.7 MB/s, 1156.8 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 374.6 MB/s, 1172.4 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 277.2 MB/s, 1010.8 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 277.6 MB/s, 1011.6 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 208.4 MB/s, 978.3 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 211.9 MB/s, 1021.1 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 149.3 MB/s, 773.6 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 151.4 MB/s, 814.5 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 115.5 MB/s, 955.4 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 113.8 MB/s, 1003.3 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 90.7 MB/s, 748.7 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 88.9 MB/s, 789.8 MB/s
compile zstd with gcc-10 │compile zstd with gcc-10
e05d07e00b848ce7b2cfbabf915c3a1d zstd │adce803b576bc6a9b0f695ed8826889a zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 728.1 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 722.7 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 887.3 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 910.8 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 376.6 MB/s, 1121.4 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 375.8 MB/s, 1136.2 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 277.6 MB/s, 981.9 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 278.6 MB/s, 994.9 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 209.9 MB/s, 976.9 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 217.5 MB/s, 1002.5 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 149.7 MB/s, 788.9 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 155.6 MB/s, 821.3 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 113.4 MB/s, 963.3 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 113.9 MB/s, 987.0 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 89.1 MB/s, 768.8 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 88.8 MB/s, 795.8 MB/s
compile zstd with gcc-11 │compile zstd with gcc-11
e4f2395a06eb34943d9842230b1487bd zstd │d6315f4985d2cedeba8aae3c50001723 zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 746.1 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 713.6 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 900.6 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 906.6 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 377.1 MB/s, 1165.0 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 381.6 MB/s 1144.1 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 277.4 MB/s, 1028.4 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 282.8 MB/s, 993.1 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 211.7 MB/s, 1004.8 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 211.9 MB/s, 1001.5 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 152.1 MB/s, 813.7 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 152.7 MB/s, 814.7 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 112.1 MB/s, 994.8 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 112.7 MB/s, 994.2 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 87.6 MB/s, 793.9 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 87.9 MB/s, 798.4 MB/s
compile zstd with clang-6.0 │compile zstd with clang-6.0
1326dd6d47f0cf1038405509e4dcc8a2 zstd │b9ae98f025959f945f496a6c2399fc44 zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 766.2 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 742.0 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 912.0 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 908.0 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 388.7 MB/s 1156.9 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 387.4 MB/s 1172.2 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 285.7 MB/s, 1021.7 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 286.2 MB/s, 1025.7 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 210.6 MB/s, 1021.5 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 205.6 MB/s, 1017.3 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 151.5 MB/s, 843.1 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 148.9 MB/s, 819.9 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 119.7 MB/s, 1004.4 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 117.2 MB/s, 993.4 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 93.8 MB/s, 816.8 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 91.2 MB/s, 785.9 MB/s
compile zstd with clang-7 │compile zstd with clang-7
ba63e5b28b30123584877e6e54ec4a30 zstd │83696df6ca4a6135c084ae318ba83f46 zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 753.8 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 732.8 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 939.1 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 935.2 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 383.2 MB/s 1196.9 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 375.9 MB/s, 1227.0 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 282.1 MB/s, 1061.3 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 277.6 MB/s, 1075.5 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 215.1 MB/s, 1054.6 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 206.7 MB/s, 1057.7 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 153.3 MB/s, 873.9 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 148.0 MB/s, 845.2 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 117.0 MB/s, 1033.6 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 117.0 MB/s, 1045.3 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 91.0 MB/s, 840.5 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 91.5 MB/s, 829.1 MB/s
compile zstd with clang-8 │compile zstd with clang-8
871bd27ac7ee239cf9eacb8f246a2978 zstd │dd0ce6aa5779a3f6d7fdaedf981ba91c zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 781.8 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 742.0 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 939.7 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 957.0 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 395.4 MB/s 1196.2 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 391.6 MB/s 1215.5 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 291.8 MB/s, 1058.9 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 288.4 MB/s, 1076.2 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 212.8 MB/s, 1056.7 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 207.0 MB/s, 1061.3 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 153.4 MB/s, 872.9 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 149.2 MB/s, 863.1 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 116.7 MB/s, 1021.9 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 120.6 MB/s, 1052.2 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 89.6 MB/s, 823.9 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 93.9 MB/s, 847.0 MB/s
compile zstd with clang-9 │compile zstd with clang-9
c18dfa5ea7e05ea520ace70cb9943911 zstd │351e250243d8a66260a9d863737a8ba2 zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 713.5 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 757.6 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 886.0 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 929.9 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 385.8 MB/s 1137.6 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 380.1 MB/s, 1179.1 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 283.1 MB/s, 985.8 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 281.1 MB/s, 1028.8 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 207.2 MB/s, 988.0 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 211.0 MB/s, 1032.1 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 149.3 MB/s, 795.3 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 151.6 MB/s, 836.2 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 117.3 MB/s, 975.7 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 114.7 MB/s, 1015.5 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 92.7 MB/s, 774.3 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 91.4 MB/s, 810.8 MB/s
compile zstd with clang-10 │compile zstd with clang-10
93d31c296b04e79ab3f727ced8e0c316 zstd │31d95bf85a5c4c80762b2d65c114b21e zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 712.9 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 732.4 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 916.4 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 936.1 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 396.3 MB/s 1168.8 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 391.4 MB/s 1193.7 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 291.9 MB/s, 1016.3 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 289.9 MB/s, 1050.0 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 209.7 MB/s, 1022.1 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 210.7 MB/s, 1049.8 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 151.6 MB/s, 824.2 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 152.5 MB/s, 857.4 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 119.7 MB/s, 1003.5 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 116.5 MB/s, 1016.8 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 93.1 MB/s, 798.0 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 90.8 MB/s, 813.3 MB/s
compile zstd with clang-11 │compile zstd with clang-11
bf96e2cd714cae9fc6b803952a44499b zstd │aab7a34b2c2d36d77b3255b8779cde5c zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 746.3 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 767.6 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 937.4 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 959.0 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 382.5 MB/s 1205.4 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 381.8 MB/s 1240.4 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 283.9 MB/s, 1056.9 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 283.1 MB/s, 1080.7 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 203.7 MB/s, 1055.9 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 205.6 MB/s, 1080.3 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 147.8 MB/s, 859.0 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 148.7 MB/s, 864.2 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 116.2 MB/s, 1033.9 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 119.3 MB/s, 1068.0 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 90.0 MB/s, 828.0 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 92.5 MB/s, 844.6 MB/s
compile zstd with clang-12 │compile zstd with clang-12
211d0fcffd09f0b1203b34b48b14c3bc zstd │1ceac2fcef50a68677e0f0484df1f84c zstd
3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 764.5 MB/s │ 3#enwik9.L22.zst :1000000000 -> 215031773 (x4.650), 0.00 MB/s, 744.2 MB/s
3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 932.1 MB/s │ 3#lesia.tar.L19.zst : 211957760 -> 52990423 (x4.000), 0.00 MB/s, 909.5 MB/s
1#silesia.tar : 211957760 -> 73422067 (x2.887), 397.6 MB/s 1192.5 MB/s │ 1#silesia.tar : 211957760 -> 73422067 (x2.887), 399.4 MB/s 1162.0 MB/s
1#enwik8 : 100000000 -> 40667563 (x2.459), 291.7 MB/s, 1045.6 MB/s │ 1#enwik8 : 100000000 -> 40667563 (x2.459), 295.5 MB/s, 1003.1 MB/s
3#silesia.tar : 211957760 -> 66523984 (x3.186), 217.8 MB/s, 1046.5 MB/s │ 3#silesia.tar : 211957760 -> 66523984 (x3.186), 207.7 MB/s, 1007.7 MB/s
3#enwik8 : 100000000 -> 35461800 (x2.820), 157.0 MB/s, 847.7 MB/s │ 3#enwik8 : 100000000 -> 35461800 (x2.820), 148.5 MB/s, 803.4 MB/s
5#silesia.tar : 211957760 -> 63040521 (x3.362), 118.3 MB/s, 1026.7 MB/s │ 5#silesia.tar : 211957760 -> 63040521 (x3.362), 116.0 MB/s, 988.5 MB/s
5#enwik8 : 100000000 -> 33702880 (x2.967), 92.3 MB/s, 819.5 MB/s │ 5#enwik8 : 100000000 -> 33702880 (x2.967), 90.7 MB/s, 777.6 MB/s
As usual, it is pretty difficult to make sense, due to the sheer quantity of signal.
What's clear is that it's not always positive. But this is probably due to reasons outside of the responsibility of this PR, with typically random instruction alignment differences resulting in measurable speed differences.
So let's summarize:
compilers | PR decompression speed impact |
---|---|
gcc-9 , clang-9 , clang-10 , clang-11 |
worse, all levels |
gcc-7 , gcc-8 , gcc-10 |
worse at level 22, better at lower levels |
clang-6 , clang-7 , clang-8 |
better at level 22, worse at lower levels |
gcc-11 , clang-12 |
better, all levels |
There is an interesting inversion between gcc
and clang
when it comes to the impact at level 22 vs other levels.
Even then, I'm not sure that it's really related to this PR : level 22
uses a different decompression function, due to the potential of long distances matches, triggering prefetching, which is absent from lower levels. Since it's a different function, instruction alignment are different, and may explain the performance differences. It's just weird that, for each compiler, the direction is always the same, across multiple versions.
So sure, some compiler versions are clearly better, but others aren't, hence it's not a clear win.
The best argument in favor of this PR so far is the godbolt
trace, which shows a neat reduction in assembly instruction count for BIT_readBits()
. Sure, instruction count is not everything, but in this case, there is no hidden branch, additional fetch nor long instruction, so the reduction in nb of instructions is expected to be beneficial for performance.
By the way, I also noticed that,
while the new formulation of BIT_readBits()
is more concise with BMI2
,
it's not the case when BMI2
is not available:
// new formulation, with BMI2
BIT_readBits(BIT_DStream_t*, unsigned int): # @BIT_readBits(BIT_DStream_t*, unsigned int)
movl 8(%rdi), %ecx
subl %esi, %ecx
shrxq %rcx, (%rdi), %rax
bzhiq %rsi, %rax, %rax
movl %ecx, 8(%rdi)
retq
// old formulation, with BMI2
BIT_readBits(BIT_DStream_t*, unsigned int): # @BIT_readBits(BIT_DStream_t*, unsigned int)
movl 8(%rdi), %ecx
addl %esi, %ecx
movl %ecx, %eax
negb %al
shrxq %rax, (%rdi), %rax
bzhiq %rsi, %rax, %rax
movl %ecx, 8(%rdi)
retq
// Old formulation, no BMI support
BIT_readBits(BIT_DStream_t*, unsigned int): # @BIT_readBits(BIT_DStream_t*, unsigned int)
movq (%rdi), %rdx
movl 8(%rdi), %r8d
addl %esi, %r8d
movl %r8d, %ecx
negb %cl
shrq %cl, %rdx
movq $-1, %rax
movl %esi, %ecx
shlq %cl, %rax
movl %r8d, 8(%rdi)
notq %rax
andq %rdx, %rax
retq
BIT_readBitsFast(BIT_DStream_t*, unsigned int): # @BIT_readBitsFast(BIT_DStream_t*, unsigned int)
movq (%rdi), %rax
movl 8(%rdi), %edx
movl %edx, %ecx
shlq %cl, %rax
movl %esi, %ecx
negb %cl
shrq %cl, %rax
addl %esi, %edx
movl %edx, 8(%rdi)
retq
// New formulation, no BMI support
BIT_readBits(BIT_DStream_t*, unsigned int): # @BIT_readBits(BIT_DStream_t*, unsigned int)
movq (%rdi), %r8
movl 8(%rdi), %edx
subl %esi, %edx
movl %edx, %ecx
shrq %cl, %r8
movq $-1, %rax
movl %esi, %ecx
shlq %cl, %rax
notq %rax
andq %r8, %rax
movl %edx, 8(%rdi)
retq
It's about on par with the old formulation of BIT_readBits()
, but heavier than BIT_readBitsFast()
.
I guess that's the scenario for which BIT_readBitsFast()
matters,
and I would expect the old formulation to win in this case when BIT_readBitsFast()
is employed.
Anyway, the question is: what happens when BMI2
is not available?
Does the code revert back to the old formulation? Or only the new formulation remains available?
I guess the topic might be similar for non-x64 platforms, without equivalent of BMI2
available.
edit : As a verification test, I compared dev
and PR4047
with -mno-bmi2
and DYNAMIC_BMI2
disabled, on the same i7-9700k platform, and indeed, the comparison is largely favorable to the old formulation in this case.
I've looked into this as well, and I think something else that's worth discussing is the usage pattern of bitstreams and reloads. Specifically (for example in FSE / Huffman) we'd make ~4 reads from the stream and them reload it (when decoding we sometimes make less reads per reload).
Here (godbolt) we can see an example that reads 4 elements from the bitstream and reloads it, and the base version is significantly shorter mainly due to the reload. Interestingly the reading operation actually takes the same number of opcodes in both versions as the base version can add two registers into a third one using lea
and the pr versions requires both mov
and sub
.
Eventually, it's not clear to me that one is better than the other, and my guess would be that it really depends on context and pattern of usage.
This change improves performance for x86 with BMI2 because before we were doing subtraction from 64 to shift but if we move it to bitsLeft, we are going to subtract only once
https://gcc.godbolt.org/z/dMY3j6rEa
Before:
after:
Results. On Arm processors I got regression up to 1% but on Intel Xeon I got really nice uplifts. AMD was less sensistive but also got 1-2%. It's much better seen for well compressed data when we change FSE states a lot. clang is clang 16, gcc is gcc 13.2.0.
Intel(R) Xeon(R) CPU @ 2.00GHz (Skylake)
Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
AMD EPYC 7B13 Zen3
In https://github.com/google/fleetbench where we hand out our production corpora, compression ratios, levels, statistics for our top 10 biggest workloads, we have the following benchmarks (CPU per byte):
Intel Skylake:
AMD Zen3:
Hope you can benchmark on your own and validate that it's better :)