Excellent. An attempt for a console tool.

Sanmayce commented 8 years ago

Hi Conor, many thanks for the undreamt performance of your LZSSE2, simply the FASTEST decompressor!

Not an issue but feedback, wish this site had [Feedback] section as well. My wish was to have LZSSE2 in form of console tool as well, so I attempted to do so, but not as it had to be, your etude - your tool, that's the right combination, however I wanted level 17 in my textual comparisons so I just embedded LZSSE2 into my fastest (old, 1MB sliding window) Nakamichi, the result:

LZSSE2 excels at:

much faster decompression;
better compression ratio, especially when considering 64KB vs 1MB;
100x faster compression.

Overall, significantly better everywhere, LZSSE2 is superior to Tengu, hands down. In my benchmarks with Hamid's TurboBench from (Feb 21), LZSSE2 level 16 decompresses 2x faster than Nakamichi 'Goldenboy'! However with Haswell and above I expect 3x, even 4x. For more tests (console dumps), you may see my compression logs/notes (far from finished) at: www.sanmayce.com/Downloads/The-Last-Stand_booklet.pdf

Also, in the www.sanmayce.com/Downloads/TEXTUAL_MADNESS.zip package I made one .bat file running 12 compressors for a given file, thus giving quick look where one is ranked:

Performers:

LZ4 for Windows 32-bits v1.4, by Yann Collet (Sep 17 2013).
7-Zip (A) 9.20, Copyright (c) 1999-2010 Igor Pavlov, 2010-11-18.
bsc, Block Sorting Compressor, Version 3.1.0. Copyright (c) 2009-2012 Ilya Grebnov, 8 July 2012.
lzturbo 1.2 Copyright (c) 2007-2014 Hamid Buzidi, Aug 11 2014.
zpaq v7.05 journaling archiver, compiled Apr 17 2015, http://mattmahoney.net/zpaq
CABARC, Microsoft (R) Cabinet Tool - Version 5.1.2600.0, Copyright (c) Microsoft Corporation.
Compress, version: (N)compress 4.2.4.4, compiled: Fri, Aug 23, 2013 11:56:09. Authors: Peter Jannesen, Dave Mack, Spencer W. Thomas, Jim McKie, Steve Davies, Ken Turkowski, James A. Woods, Joe Orost.
zstd command line interface 64-bits v0.5.1, by Yann Collet
xz (XZ Utils) 5.2.1, liblzma 5.2.1, XZ Utils home page: http://tukaani.org/xz/
brotli, Feb-10-2016 source
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m2 enforced, muffinesque suggestion by Jim Dempsey enforced.
GLZA v0.4.1, Copyright 2014-2016 Kennon Conrad

D:\TEXTUAL_MADNESS\Nakamichi_(Tengu-Tsuyo)_1MB-Sliding-Window_vs_LZSSE2\_The_Usual_Suspects>Bundle_of_11_compressors.bat Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt

03/31/2016  04:25 AM         6,225,580 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt
03/31/2016  04:25 AM         1,869,174 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.128MB.7z
                             2,388,411 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.Level17.LZSSE2
03/31/2016  04:25 AM         2,704,954 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.256MB.lzturbo12-19.lzt
03/31/2016  04:26 AM         2,438,567 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.256MB.lzturbo12-29.lzt
03/31/2016  04:26 AM         1,899,416 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.256MB.lzturbo12-39.lzt
03/31/2016  05:39 AM         1,590,341 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.glze
03/31/2016  04:27 AM         1,886,657 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.L11_W24.brotli
03/31/2016  04:26 AM         1,896,775 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.L21.zst
03/31/2016  04:25 AM         2,740,872 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.L9.lz4
03/31/2016  04:25 AM         2,258,711 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.L9.zip
03/31/2016  04:26 AM         1,881,697 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.LZX21.cab
03/31/2016  04:26 AM         6,229,861 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.method08.zpaq
03/31/2016  04:26 AM         2,272,354 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.method28.zpaq
03/31/2016  04:26 AM         1,435,525 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.method58.zpaq
03/31/2016  04:26 AM         2,380,947 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.MSZIP.cab
03/31/2016  04:25 AM         1,582,176 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.ST6Block256.bsc
03/31/2016  04:48 AM         2,577,921 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.Tengu-Tsuyo.Nakamichi
03/31/2016  04:27 AM         1,870,860 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.xz
03/31/2016  04:26 AM         2,464,941 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.Z

Level17 gives excellent tightness and incredible decompression speed (i5-2430M @3GHz, DDR3 @666MHz):

--------------------------------------------------------------------
| decompressor            | compression size | decompression speed |
--------------------------------------------------------------------
| LZSSE2 level 17         |        2,388,411 |           2277 MB/s |
| LzTurbo v1.2 -19        |        2,704,954 |        1962.67 MB/s |
| Nakamichi 'Tengu-Tsuyo' |        2,577,921 |           1152 MB/s |
--------------------------------------------------------------------

The SSE4.1 and AVX .cod files are included (Assembly, that is), do you see register utilization/distribution as you intended? In AVX code I see 4466 lines for LZSSE2_Decompress procedure, while the SSE4.1 amounts to 4819, how does this translate into speed, say, on Haswell? On i5-2430M @3GHz, DDR3 @666MHz I see no speed difference, at all:

D:\Conor>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe alice29.txt
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 0 MB ...
Allocating Target-Buffer 32 MB ...
Compressing 152,089 bytes ...
-; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 31
NumberOf(Tiny)Matches[Tiny]Window (4): 161
NumberOf(Short)Matches[Tiny]Window (8): 98
NumberOf(Medium)Matches[Tiny]Window (12): 21
RAM-to-RAM performance: 23 KB/s.
Compressed to 73,235 bytes.
LZSSE2: Compressing with LZSSE2 (level 17) 152,089 bytes ...
LZSSE2: Compressed to 56526 bytes.
LZSSE2: RAM-to-RAM performance: 8736 KB/s.
LZSSE2: Allocating Verification-Buffer 0 MB ...
LZSSE2: Decompressing 56,526 bytes ...
LZSSE2: RAM-to-RAM performance: 53 MB/s.
LZSSE2: Verification (input and output match) OK.

D:\Conor>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe University_of_Canterbury_The_Calgary_Corpus.tar
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 3 MB ...
Allocating Target-Buffer 35 MB ...
Compressing 3,265,536 bytes ...
/; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 2966
NumberOf(Tiny)Matches[Tiny]Window (4): 5396
NumberOf(Short)Matches[Tiny]Window (8): 3633
NumberOf(Medium)Matches[Tiny]Window (12): 27429
RAM-to-RAM performance: 5 KB/s.
Compressed to 1,333,349 bytes.
LZSSE2: Compressing with LZSSE2 (level 17) 3,265,536 bytes ...
LZSSE2: Compressed to 1142536 bytes.
LZSSE2: RAM-to-RAM performance: 463 KB/s.
LZSSE2: Allocating Verification-Buffer 3 MB ...
LZSSE2: Decompressing 1,142,536 bytes ...
LZSSE2: RAM-to-RAM performance: 1089 MB/s.
LZSSE2: Verification (input and output match) OK.

D:\Conor>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 5 MB ...
Allocating Target-Buffer 37 MB ...
Compressing 6,225,580 bytes ...
-; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 57
NumberOf(Tiny)Matches[Tiny]Window (4): 5321
NumberOf(Short)Matches[Tiny]Window (8): 586
NumberOf(Medium)Matches[Tiny]Window (12): 18
RAM-to-RAM performance: 5 KB/s.
Compressed to 2,577,921 bytes.
LZSSE2: Compressing with LZSSE2 (level 17) 6,225,580 bytes ...
LZSSE2: Compressed to 2388411 bytes.
LZSSE2: RAM-to-RAM performance: 1443 KB/s.
LZSSE2: Allocating Verification-Buffer 5 MB ...
LZSSE2: Decompressing 2,388,411 bytes ...
LZSSE2: RAM-to-RAM performance: 2277 MB/s.
LZSSE2: Verification (input and output match) OK.

D:\Conor>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe dickens
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 9 MB ...
Allocating Target-Buffer 41 MB ...
Compressing 10,192,446 bytes ...
\; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 128
NumberOf(Tiny)Matches[Tiny]Window (4): 6259
NumberOf(Short)Matches[Tiny]Window (8): 712
NumberOf(Medium)Matches[Tiny]Window (12): 89
RAM-to-RAM performance: 5 KB/s.
Compressed to 3,964,930 bytes.
LZSSE2: Compressing with LZSSE2 (level 17) 10,192,446 bytes ...
LZSSE2: Compressed to 3872373 bytes.
LZSSE2: RAM-to-RAM performance: 6017 KB/s.
LZSSE2: Allocating Verification-Buffer 9 MB ...
LZSSE2: Decompressing 3,872,373 bytes ...
LZSSE2: RAM-to-RAM performance: 3692 MB/s.
LZSSE2: Verification (input and output match) OK.

And to mix Level 17 with the TurboBench' results:

D:\Conor>turbobenchs.exe alice29.txt -ezlib,9/fastlz,2/lzturbo,19,29,39/bzip2/chameleon,2/snappy_c/zstd,1,21/density,3/lz4,16/lz5,15/lzham,4/brieflz/brotli,11/crush,2/lzma,9/zpaq,2/lzf/yappy/trle/memcpy/lzsse2,1,16 -g -k0 -B2G
...
TurboBench:  - Thu Mar 31 03:24:07 2016

      C Size  ratio%     C MB/s     D MB/s   Name            File
       43206    28.4      10.65      26.32   bzip2           alice29.txt.tbb
       46514    30.6       0.48     178.72   brotli 11       alice29.txt.tbb
       48467    31.9       2.86      49.62   lzma 9          alice29.txt.tbb
       49714    32.7       4.59     360.40   zstd 21         alice29.txt.tbb
       50001    32.9       2.52     498.65   lzturbo 39      alice29.txt.tbb
       51230    33.7       1.80      90.21   lzham 4         alice29.txt.tbb
       54174    35.6      10.91     208.91   zlib 9          alice29.txt.tbb
       55992    36.8       3.79     189.17   crush 2         alice29.txt.tbb
       56526                         53 MB/s lzsse2 17       !outside TurboBench!
       56530    37.2       5.64    1810.58   lzsse2 16       alice29.txt.tbb
       59041    38.8     102.90     320.86   zstd 1          alice29.txt.tbb
       59168    38.9       5.99      51.56   zpaq 2          alice29.txt.tbb
       61582    40.5       2.57     578.29   lz5 15          alice29.txt.tbb
       62405    41.0       2.75     731.20   lzturbo 29      alice29.txt.tbb
       62982    41.4       3.15    1653.14   lzturbo 19      alice29.txt.tbb
       63667    41.9      18.05    1382.63   lz4 16          alice29.txt.tbb
       67441    44.3      51.19     147.95   brieflz         alice29.txt.tbb
       73235                       9280 MB/s Nakamichi 'TT'  !outside TurboBench!
       74370    48.9      11.61    1246.63   lzsse2 1        alice29.txt.tbb
       80799    53.1      55.61    1216.71   yappy           alice29.txt.tbb
       82930    54.5     165.49     336.48   lzf             alice29.txt.tbb
       84753    55.7     160.60     288.05   fastlz 2        alice29.txt.tbb
       88021    57.9     228.71     620.77   snappy_c        alice29.txt.tbb
       93187    61.3      45.17      95.83   density 3       alice29.txt.tbb
      102175    67.2     783.96    1134.99   chameleon 2     alice29.txt.tbb
      149618    98.4     131.00    1178.98   trle            alice29.txt.tbb
      152093   100.0   11699.15   10863.50   memcpy          alice29.txt.tbb

D:\Conor>turbobenchs.exe University_of_Canterbury_The_Calgary_Corpus.tar -ezlib,9/fastlz,2/lzturbo,19,29,39/bzip2/chameleon,2/snappy_c/zstd,1,21/density,3/lz4,16/lz5,15/lzham,4/brieflz/brotli,11/crush,2/lzma,9/zpaq,2/lzf/yappy/trle/memcpy/lzsse2,1,16 -g -k0 -B2G
...
TurboBench:  - Thu Mar 31 03:24:44 2016

      C Size  ratio%     C MB/s     D MB/s   Name            File
      851327    26.1       1.74      56.80   lzma 9          University_of_Canterbury_The_Calgary_Corpus.tar.tbb
      863894    26.5       0.38     223.57   brotli 11       University_of_Canterbury_The_Calgary_Corpus.tar.tbb
      890859    27.3      11.58      23.16   bzip2           University_of_Canterbury_The_Calgary_Corpus.tar.tbb
      902605    27.6       1.55     643.33   lzturbo 39      University_of_Canterbury_The_Calgary_Corpus.tar.tbb
      911442    27.9       2.64     408.96   zstd 21         University_of_Canterbury_The_Calgary_Corpus.tar.tbb
      913218    28.0       1.60     125.16   lzham 4         University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1022374    31.3       0.71     230.05   crush 2         University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1060310    32.5       9.42     216.94   zlib 9          University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1077695    33.0       4.89      59.12   zpaq 2          University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1109125    34.0       1.70     614.17   lz5 15          University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1114744    34.1       1.77     864.13   lzturbo 29      University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1142536                       1089 MB/s lzsse2 17       !outside TurboBench!
     1143886    35.0       6.18    1894.16   lzsse2 16       University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1188009    36.4     160.68     492.99   zstd 1          University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1231146    37.7       2.01    2177.02   lzturbo 19      University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1240210    38.0       4.05    1704.35   lz4 16          University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1298840    39.8      87.50     185.24   brieflz         University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1333349                       1152 MB/s Nakamichi 'TT'  !outside TurboBench!
     1405646    43.0      13.62    1544.72   lzsse2 1        University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1596322    48.9     203.79     411.69   lzf             University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1607581    49.2     188.43     367.82   fastlz 2        University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1610706    49.3     199.48     191.91   density 3       University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1624766    49.8      67.02    1497.95   yappy           University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     1682025    51.5     270.12     724.07   snappy_c        University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     2084324    63.8     939.99    1426.62   chameleon 2     University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     2810417    86.1     193.27    5070.71   trle            University_of_Canterbury_The_Calgary_Corpus.tar.tbb
     3265540   100.0    6557.30    5094.44   memcpy          University_of_Canterbury_The_Calgary_Corpus.tar.tbb

D:\Conor>turbobenchs.exe Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt -ezlib,9/fastlz,2/lzturbo,19,29,39/bzip2/chameleon,2/snappy_c/zstd,1,21/density,3/lz4,16/lz5,15/lzham,4/brieflz/brotli,11/crush,2/lzma,9/zpaq,2/lzf/yappy/trle/memcpy/lzsse2,1,16 -g -k0 -B2G
...
TurboBench:  - Thu Mar 31 03:26:16 2016

      C Size  ratio%     C MB/s     D MB/s   Name            File
     1757268    28.2       9.84      19.54   bzip2           Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     1871038    30.1       1.39      60.73   lzma 9          Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     1879568    30.2       0.39     242.60   brotli 11       Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     1896779    30.5       1.71     377.10   zstd 21         Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     1899349    30.5       1.40     584.89   lzturbo 39      Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     1900373    30.5       1.20     140.49   lzham 4         Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     2176735    35.0       0.18     206.36   crush 2         Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     2268527    36.4       3.27      58.43   zpaq 2          Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     2373311    38.1       9.58     200.85   zlib 9          Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     2388411                       2277 MB/s lzsse2 17       !outside TurboBench!
     2391442    38.4       5.45    1961.43   lzsse2 16       Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     2432148    39.1       1.61     465.15   lz5 15          Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     2438540    39.2       1.46     778.68   lzturbo 29      Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     2577921                       1152 MB/s Nakamichi 'TT'  !outside TurboBench!
     2616408    42.0     137.02     435.72   zstd 1          Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     2704928    43.4       1.69    1962.67   lzturbo 19      Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     2735066    43.9      15.21    1617.87   lz4 16          Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     2889607    46.4      72.81     152.13   brieflz         Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     3147772    50.6     260.04     241.30   density 3       Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     3281175    52.7      13.60    1379.78   lzsse2 1        Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     3569529    57.3     998.65    1802.43   chameleon 2     Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     3598717    57.8     182.00     362.88   lzf             Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     3608159    58.0      60.16    1158.89   yappy           Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     3618356    58.1     155.23     310.36   fastlz 2        Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     3923093    63.0     232.06     581.94   snappy_c        Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     6224732   100.0     134.80    1299.16   trle            Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
     6225584   100.0    6672.65    5997.67   memcpy          Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb

D:\Conor>turbobench_singlefile_noNakamichi_2G.bat dickens

D:\Conor>turbobenchs.exe dickens -ezlib,9/fastlz,2/lzturbo,19,29,39/bzip2/chameleon,2/snappy_c/zstd,1,21/density,3/lz4,16/lz5,15/lzham,4/brieflz/brotli,11/crush,2/lzma,9/zpaq,2/lzf/yappy/trle/memcpy/lzsse2,1,16 -g -k0 -B2G
...
TurboBench:  - Thu Mar 31 03:29:44 2016

      C Size  ratio%     C MB/s     D MB/s   Name            File
     2799524    27.5      10.03      19.52   bzip2           dickens.tbb
     2830650    27.8       1.13      64.07   lzma 9          dickens.tbb
     2832113    27.8       0.37     211.91   brotli 11       dickens.tbb
     2847745    27.9       0.95     143.84   lzham 4         dickens.tbb
     2863997    28.1       1.37     297.34   zstd 21         dickens.tbb
     2864646    28.1       1.15     412.65   lzturbo 39      dickens.tbb
     3350093    32.9       0.14     230.43   crush 2         dickens.tbb
     3403391    33.4       2.99      55.74   zpaq 2          dickens.tbb
     3667758    36.0       1.15     607.63   lzturbo 29      dickens.tbb
     3755591    36.8       1.38     363.13   lz5 15          dickens.tbb
     3854739    37.8      10.07     209.30   zlib 9          dickens.tbb
     3872373                       3692 MB/s lzsse2 17       !outside TurboBench!
     3872652    38.0       5.43    1799.20   lzsse2 16       dickens.tbb
     3964930                       1152 MB/s Nakamichi 'TT'  !outside TurboBench!
     4278337    42.0     137.44     435.24   zstd 1          dickens.tbb
     4377516    42.9       1.35    2002.45   lzturbo 19      dickens.tbb
     4431261    43.5      15.18    1667.61   lz4 16          dickens.tbb
     4647463    45.6      73.97     151.79   brieflz         dickens.tbb
     4954934    48.6     251.39     239.81   density 3       dickens.tbb
     5259012    51.6      13.14    1408.38   lzsse2 1        dickens.tbb
     5827189    57.2     941.57    1758.83   chameleon 2     dickens.tbb
     5857340    57.5     178.72     360.04   lzf             dickens.tbb
     5872803    57.6      59.65    1199.39   yappy           dickens.tbb
     5995421    58.8     175.58     317.15   fastlz 2        dickens.tbb
     6337838    62.2     233.77     633.70   snappy_c        dickens.tbb
    10188455   100.0     135.68    1319.41   trle            dickens.tbb
    10192450   100.0    6390.25    4785.19   memcpy          dickens.tbb

D:\Conor>

Very strange, decompression speed differs a lot between Hamid's bench and mine, my trials are 64, with 'dickens' Intel 15.0 is 2x faster than GCC 5.3.0, or I am wrong?! Also, no clue, why with 'alice29' my bench gives the miserable 53 MB/s whereas TurboBench reports 1810.58MB/s?! That's why told you that my knowledge is inferior, I failed to offer reliable bench. Maybe, I will change clock() with:

#if defined(_icl_mumbo_jumbo_)
// GetRDTSC() taken from strchr.com
#if defined(_M_IX86)
unsigned long long __forceinline GetRDTSC(void) {
   __asm {
      ; Flush the pipeline
      XOR eax, eax
      CPUID
      ; Get RDTSC counter in edx:eax
      RDTSC
   }
}
#elif defined(_M_X64)
unsigned long long __forceinline GetRDTSC(void) {
    return __rdtsc();
}
#else
unsigned long long __forceinline GetRDTSC(void) {
    return GetTickCount();
}
#endif
#endif

#if defined(_gcc_mumbo_jumbo_)
static __inline__ unsigned long long GetRDTSC()
{
  unsigned hi, lo;
  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
#endif

I still don't understand, even partially, the decompression code, yet, at first glance the code generated by Intel 15.0 is tight and makes full use of registers, no?!

Cannot say keep up the fantastic work since you made good already.

Best, Sanmayce

ConorStokes commented 8 years ago

Thanks for the thorough analysis! gcc does seem to have a problem generating tight code for LZSSE, although I'm surprised Intel is that far ahead! I'll have to have a more thorough look at the assembly you generated on the weekend to see what is causing the slowdown (although I suspect you are right and the Intel code is doing a much better job with the registers).

I'd say that the slow result with the small file alice29 could be due to timer resolution or it could be due to some fixed overhead/system overhead. One thing I've found is that if the OS hasn't allocated the memory pages for the buffer yet (just mapped them), then the initial page fault to allocate memory can be quite a large amount of overhead on those small files.

Sanmayce commented 8 years ago

>Thanks for the thorough analysis!

My pleasure. AN thorough analysis is yet to come, I have only two old laptops at my disposal, it will take some 2-3 months to give a much better picture of textual [de]compression, I intend to test ~400 files, ranging from 25 bytes up to 1400MB...

>gcc does seem to have a problem generating tight code for LZSSE, although I'm surprised Intel is that far ahead!

GCC is good, but in my experience Intel distributes registers more effectively. When even a single register is not in use then stack accesses hurt speed, you know, since you utilize at max all the squad of XMMs it is even more so - the accumulative penalties matter! I think your etude is not to be presented with any compiler/options, its essence is lost ... in the translation.

>I'll have to have a more thorough look at the assembly you generated on the weekend to see what is causing the slowdown (although I suspect you are right and the Intel code is doing a much better job with the registers).

My wish is textual fans to have one fully operational console tool (even in its simplest form i.e. file-to-file), just to have one PARAGON performer and feel what the speed religion is all about.

>I'd say that the slow result with the small file alice29 could be due to timer resolution or it could be due to some fixed overhead/system overhead. One thing I've found is that if the OS hasn't allocated the memory pages for the buffer yet (just mapped them), then the initial page fault to allocate memory can be quite a large amount of overhead on those small files.

Here I admit, despite the "triviality" of such measuring, cannot sense what is going on, maybe I will ask for help Hamid or some guys at Intel's forum. Anyone?!

Sanmayce commented 8 years ago

Hi again, I'm so sorry for my first draft, a stupid bug in the stats is now fixed, all in all I was swayed by the rush to upload it, now I had time to finish the draft. The upload is reuploaded, with working executable Nakamichi&LZSSE2 pair.

Also, I believe you were right about this 'malloc' that was not in reality executed fully, I moved it far before the benchmarking, and lo, it reports okay. This night I will run some 10 files more, gradually rising up to ~50MB, just to make the quick view a bit deeper.

Another dumb mistake of mine was the statement about 100x faster compression rate, in fact it is 1000x, if not bigger.

On my laptop Core 2 Q9550s @2.83GHz, alice29.txt is decompressed much faster, I believe the report is okay (given that I had many tasks in the tray):

LZSSE2: RAM-to-RAM performance: 640 MB/s.

Also I added a few more strong compressors to the 'bundle', the full log for 'alice29.txt':

D:\TEXTUAL_MADNESS\_The_Usual_Suspects>dir alice29.txt/b>DIRLIST

D:\TEXTUAL_MADNESS\_The_Usual_Suspects>type COMPRESS_all.bat
rem dir *.*/b/a-d>DIRLIST
FOR /F %%G IN (DIRLIST) DO CALL Bundle_of_15_compressors.bat %%G

D:\TEXTUAL_MADNESS\_The_Usual_Suspects>COMPRESS_all.bat

D:\TEXTUAL_MADNESS\_The_Usual_Suspects>rem dir *.*/b/a-d>DIRLIST

D:\TEXTUAL_MADNESS\_The_Usual_Suspects>FOR /F %G IN (DIRLIST) DO CALL Bundle_of_15_compressors.bat %G

D:\TEXTUAL_MADNESS\_The_Usual_Suspects>CALL Bundle_of_15_compressors.bat alice29.txt
Performers:
- LZ4 for Windows 32-bits v1.4, by Yann Collet (Sep 17 2013).
- 7-Zip (A) 9.20, Copyright (c) 1999-2010 Igor Pavlov, 2010-11-18.
- bsc, Block Sorting Compressor, Version 3.1.0. Copyright (c) 2009-2012 Ilya Grebnov, 8 July 2012.
- lzturbo 1.2 Copyright (c) 2007-2014 Hamid Buzidi, Aug 11 2014.
- zpaq v7.05 journaling archiver, compiled Apr 17 2015, http://mattmahoney.net/zpaq
- CABARC, Microsoft (R) Cabinet Tool - Version 5.1.2600.0, Copyright (c) Microsoft Corporation.
- Compress, version: (N)compress 4.2.4.4, compiled: Fri, Aug 23, 2013 11:56:09. Authors: Peter Jannesen, Dave Mack, Spencer W. Thomas, Jim McKie, Steve Davies, Ken Turkowski, James A. Woods, Joe Orost.
- zstd command line interface 64-bits v0.5.1, by Yann Collet
- xz (XZ Utils) 5.2.1, liblzma 5.2.1, XZ Utils home page: http://tukaani.org/xz/
- brotli, Feb-10-2016 source
- Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m2 enforced, muffinesque suggestion by Jim Dempsey enforced.
- LZSSE2(FASTEST TEXTUAL decompressor), Copyright (c) 2016, Conor Stokes
- GLZA v0.4.1, Copyright 2014-2016 Kennon Conrad
- LZ5 command line interface 64-bit v1.4 by Y.Collet and P.Skibinski (Feb 11 2016)
- PPMd (Fast PPMII compressor for textual data) var.I rev.2, Dmitry Shkarin

...

 Directory of D:\TEXTUAL_MADNESS\_The_Usual_Suspects

02/23/2016  11:25 PM           152,089 alice29.txt
04/03/2016  04:27 PM            48,575 alice29.txt.128MB.7z
04/03/2016  04:27 PM            63,013 alice29.txt.256MB.lzturbo12-19.lzt
04/03/2016  04:27 PM            62,436 alice29.txt.256MB.lzturbo12-29.lzt
04/03/2016  04:27 PM            50,668 alice29.txt.256MB.lzturbo12-39.lzt
04/03/2016  04:28 PM            42,210 alice29.txt.glze
04/03/2016  04:27 PM            46,685 alice29.txt.L11_W24.brotli
04/03/2016  04:28 PM            61,012 alice29.txt.l15_256MB.lz5
04/03/2016  04:28 PM            56,526 alice29.txt.L17.LZSSE2
04/03/2016  04:27 PM            49,710 alice29.txt.L21.zst
04/03/2016  04:27 PM            63,705 alice29.txt.L9.lz4
04/03/2016  04:27 PM            51,707 alice29.txt.L9.zip
04/03/2016  04:27 PM            49,538 alice29.txt.LZX21.cab
04/03/2016  04:28 PM            38,682 alice29.txt.m256.o16.ppmd
04/03/2016  04:27 PM           152,918 alice29.txt.method08.zpaq
04/03/2016  04:27 PM            59,915 alice29.txt.method28.zpaq
04/03/2016  04:27 PM            37,602 alice29.txt.method58.zpaq
04/03/2016  04:27 PM            54,278 alice29.txt.MSZIP.cab
04/03/2016  04:27 PM            40,998 alice29.txt.ST6Block256.bsc
04/03/2016  04:28 PM            73,235 alice29.txt.Tengu-Tsuyo.Nakamichi
04/03/2016  04:27 PM            48,528 alice29.txt.xz
04/03/2016  04:27 PM            62,247 alice29.txt.Z

D:\TEXTUAL_MADNESS\_The_Usual_Suspects>

Had time only to run 4 testfiles on i5-2430M @3GHz, DDR3 @666MHz:

D:\test>dir
 Volume in drive D is COMETA_V1
 Volume Serial Number is 2E2F-7737

 Directory of D:\test

04/03/2016  02:51 PM    <DIR>          .
04/03/2016  02:51 PM    <DIR>          ..
02/23/2016  12:25 PM           152,089 alice29.txt
04/03/2016  02:51 PM            56,526 alice29.txt.L17.LZSSE2
04/03/2016  02:51 PM            73,235 alice29.txt.Nakamichi
03/27/2016  05:24 AM         6,225,580 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt
02/23/2016  12:25 PM        10,192,446 dickens
04/03/2016  04:56 AM           146,432 Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe
02/23/2016  12:25 PM        12,030,464 New_York_Times_Bestsellers_-_August_2015_-_20_ebooks.tar
02/23/2016  12:25 PM         3,265,536 University_of_Canterbury_The_Calgary_Corpus.tar
               8 File(s)     32,142,308 bytes
               2 Dir(s)  13,557,866,496 bytes free

D:\test>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe alice29.txt
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST Textual Decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 0 MB ...
Allocating Target-Buffer 32 MB ...
Allocating Verification-Buffer 0 MB ...
Compressing 152,089 bytes ...
-; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 31
NumberOf(Tiny)Matches[Tiny]Window (4): 161
NumberOf(Short)Matches[Tiny]Window (8): 98
NumberOf(Medium)Matches[Tiny]Window (12): 21
RAM-to-RAM performance: 23 KB/s.
Compressed to 73,235 bytes.
Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x1366,78ee
Target-file-Hash(FNV1A_YoshimitsuTRIAD) = 0xa374,22ff
Decompressing 73,235 (being the compressed stream) bytes ...
RAM-to-RAM performance: 1152 MB/s.
Verification (input and output sizes match) OK.
Verification (input and output blocks match) OK.
LZSSE2: Compressing with LZSSE2 (level 17) 152,089 bytes ...
LZSSE2: Compressed to 56,526 bytes.
LZSSE2: RAM-to-RAM performance: 8736 KB/s.
LZSSE2: Decompressing 56,526 bytes (being the compressed stream) ...
LZSSE2: RAM-to-RAM performance: 18560 MB/s.
LZSSE2: Verification (input and output sizes match) OK.
LZSSE2: Verification (input and output blocks match) OK.

D:\test>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe University_of_Canterbury_The_Calgary_Corpus.tar
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST Textual Decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 3 MB ...
Allocating Target-Buffer 35 MB ...
Allocating Verification-Buffer 3 MB ...
Compressing 3,265,536 bytes ...
/; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 2966
NumberOf(Tiny)Matches[Tiny]Window (4): 5396
NumberOf(Short)Matches[Tiny]Window (8): 3633
NumberOf(Medium)Matches[Tiny]Window (12): 27429
RAM-to-RAM performance: 5 KB/s.
Compressed to 1,333,349 bytes.
Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0xe6d1,9b78
Target-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x2a73,4318
Decompressing 1,333,349 (being the compressed stream) bytes ...
RAM-to-RAM performance: 1024 MB/s.
Verification (input and output sizes match) OK.
Verification (input and output blocks match) OK.
LZSSE2: Compressing with LZSSE2 (level 17) 3,265,536 bytes ...
LZSSE2: Compressed to 1,142,536 bytes.
LZSSE2: RAM-to-RAM performance: 463 KB/s.
LZSSE2: Decompressing 1,142,536 bytes (being the compressed stream) ...
LZSSE2: RAM-to-RAM performance: 2048 MB/s.
LZSSE2: Verification (input and output sizes match) OK.
LZSSE2: Verification (input and output blocks match) OK.

D:\test>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST Textual Decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 5 MB ...
Allocating Target-Buffer 37 MB ...
Allocating Verification-Buffer 5 MB ...
Compressing 6,225,580 bytes ...
-; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 57
NumberOf(Tiny)Matches[Tiny]Window (4): 5321
NumberOf(Short)Matches[Tiny]Window (8): 586
NumberOf(Medium)Matches[Tiny]Window (12): 18
RAM-to-RAM performance: 5 KB/s.
Compressed to 2,577,921 bytes.
Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x0736,474e
Target-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x4bb3,2a5b
Decompressing 2,577,921 (being the compressed stream) bytes ...
RAM-to-RAM performance: 1024 MB/s.
Verification (input and output sizes match) OK.
Verification (input and output blocks match) OK.
LZSSE2: Compressing with LZSSE2 (level 17) 6,225,580 bytes ...
LZSSE2: Compressed to 2,388,411 bytes.
LZSSE2: RAM-to-RAM performance: 1504 KB/s.
LZSSE2: Decompressing 2,388,411 bytes (being the compressed stream) ...
LZSSE2: RAM-to-RAM performance: 1920 MB/s.
LZSSE2: Verification (input and output sizes match) OK.
LZSSE2: Verification (input and output blocks match) OK.

D:\test>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe dickens
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST Textual Decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 9 MB ...
Allocating Target-Buffer 41 MB ...
Allocating Verification-Buffer 9 MB ...
Compressing 10,192,446 bytes ...
\; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 128
NumberOf(Tiny)Matches[Tiny]Window (4): 6259
NumberOf(Short)Matches[Tiny]Window (8): 712
NumberOf(Medium)Matches[Tiny]Window (12): 89
RAM-to-RAM performance: 5 KB/s.
Compressed to 3,964,930 bytes.
Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x056f,7c86
Target-file-Hash(FNV1A_YoshimitsuTRIAD) = 0xb7f0,b59f
Decompressing 3,964,930 (being the compressed stream) bytes ...
RAM-to-RAM performance: 1152 MB/s.
Verification (input and output sizes match) OK.
Verification (input and output blocks match) OK.
LZSSE2: Compressing with LZSSE2 (level 17) 10,192,446 bytes ...
LZSSE2: Compressed to 3,872,373 bytes.
LZSSE2: RAM-to-RAM performance: 6017 KB/s.
LZSSE2: Decompressing 3,872,373 bytes (being the compressed stream) ...
LZSSE2: RAM-to-RAM performance: 1920 MB/s.
LZSSE2: Verification (input and output sizes match) OK.
LZSSE2: Verification (input and output blocks match) OK.

D:\test>

Comparing with TurboBench I see no discrepancy:

     1142536                       2048 MB/s lzsse2 17       !outside TurboBench!
     1143886    35.0       6.18    1894.16   lzsse2 16       University_of_Canterbury_The_Calgary_Corpus.tar.tbb

     2388411                       1920 MB/s lzsse2 17       !outside TurboBench!
     2391442    38.4       5.45    1961.43   lzsse2 16       Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb

     3872373                       1920 MB/s lzsse2 17       !outside TurboBench!
     3872652    38.0       5.43    1799.20   lzsse2 16       dickens.tbb

Except for the 'small file':

       56526                      18560 MB/s lzsse2 17       !outside TurboBench!
       56530    37.2       5.64    1810.58   lzsse2 16       alice29.txt.tbb

Food for thought: 640MB/s vs 18560 MB/s (clean Windows 7 with no tasks in the tray). The measurement differs, yet I think it is correct now, or not?

Sanmayce commented 11 months ago

Hi, once more wanted to share what awesomeness in action looks like.

Benchmarking 'TDELCC' a.k.a. The-Definitive-English-Language-Compression-Corpus, a smashdown, https://github.com/Sanmayce/Nakamichi

Another iteration of Sanmayce's decompression showdown 'FULG', revision 4, all performers are included in the package:

128t_opaque_GS.png: https://drive.google.com/file/d/1wPVbSSArPFd7_JyoOMS52Sx-HHsYQK9k/view?usp=sharing

Fulg-Textual_[De]Compression_Showdown_v4.tar.gz: https://drive.google.com/file/d/1F3yxgHrLNlrAM5Uc3pgFhtyTzkvS6sSR/view?usp=sharing

Satanichi_smashdown.pdf: https://drive.google.com/file/d/1uqGoWbn0WYM1l__wnGqmO61RegU521_v/view?usp=sharing

Always, it is good to get the picture how the latest compressors fare in TEXTUAL realm. The name of the game is: applying maximum compression strength, aiming at maximum decompression ... speed, heh-heh.

Included compressors:

RAR v.7.00beta3 by Alexander Roshal, Russia;
BR, Brotli v.1.1.0 by Jyrki Alakuijala, Finland;
ZPAQ v.7.15 by Matt Mahoney, America;
GZ, 7zip's GZ v.23.01 by Igor Pavlov, Russia;
BZ2, 7zip's BZ2 v.23.01 by Igor Pavlov, Russia;
7Z, 7zip's 7Z v.23.01 by Igor Pavlov, Russia;
ZSTD v.1.5.5 by Yann Collet aka Cyan, France;
BSC v.3.3.3 by Ilya Grebnov aka Gribok, Russia;
LZSSE by Conor Stokes, Australia;
Satanichi, Sanmayce's texttoy, Bulgaria;
BriefLZ v.1.3.0 by Joergen Ibsen, Denmark.

@IlyaGrebnov @jibsen

Compression command lines:

/bin/time -v ./brotli_1.1.0 -q 11 --large_window=30 "$1" 
/bin/time -v ./rarlinux-x64-700b3 a -m5 -md2g "$1".rar "$1"
/bin/time -v ./7zzs a "$1".7z -mx9 -myx9 -m0=LZMA2:d1536m "$1"
/bin/time -v ./7zzs a -tbzip2 -mx=9 "$1.bz2" "$1"
/bin/time -v ./7zzs a -tgzip -mx=9 "$1.gz" "$1"
/bin/time -v ./BSC_3.3.3_AVX2_CLANG_17.0.4_dynamic.elf e "$1" "$1.bsc" -p -b2047 -m0 -e2
/bin/time -v ./zstd-v1.5.5 --ultra -22 --long=31 --zstd=wlog=31,clog=30,hlog=30,slog=26 "$1" -o "$1.zst"
/bin/time -v ./LZSSE_avx2_CLANG.elf -2 -l17 "$1" "$1.lzsse2"
/bin/time -v ./BriefLZ_1.3.0_CLANG_17.0.4_64bit.elf --optimal -b3g "$1" "$1.blz"
/bin/time -v ./zpaq715_sse4.1.elf add "$1.zpaq" "$1" -method 511 -threads 4
/bin/time -v ./"Satanichi_Nakamichi_Vanilla_LITE_DD-128AES_CLANG(17.0.6)_64bit.elf" "$1" "$1.Nakamichi" 20 111000 i

Decompression command lines:

perf stat -d ./brotli_1.1.0 -d -k "$1".br 
perf stat -d ./rarlinux-x64-700b3 x "$1".rar
perf stat -d ./7zzs e "$1.7z"
perf stat -d ./7zzs e "$1.bz2"
perf stat -d ./7zzs e "$1.gz"
perf stat -d ./BSC_3.3.3_AVX2_CLANG_17.0.4_dynamic.elf d "$1.bsc" "$1"
perf stat -d ./zstd-v1.5.5 -f --priority=rt -d --long=31 "$1.zst"
perf stat -d ./LZSSE_avx2_CLANG.elf -d "$1.lzsse2" "$1" 
perf stat -d ./BriefLZ_1.3.0_CLANG_17.0.4_64bit.elf -d -b3g "$1.blz" "$1"
perf stat -d ./zpaq715_sse4.1.elf  x "$1.zpaq" -threads 4
perf stat -d ./"Satanichi_Nakamichi_Vanilla_LITE_DD-128AES_CLANG(17.0.6)_64bit.elf" $1.Nakamichi>$1.NKMCH

Testmachinette: Laptop 'Dzvertcheto' Thinkpad L490, Intel i7-8565U (8) @ 4.600GHz, 64GB, Linux Fedora 39 Testdatafile: sha1: 8326b48e3a315f4f656013629226c319fefd483e SUPRAPIG_Delphi_Classics_Complete_Works_of_128_authors.tar (1,576,788,480 bytes)

+-----------------------------+-----------------+-----------[ Sorted by Walltime ]-+------------------+---------------------+
| Compressor                  | Compressed size | Walltime / Usertime / Systemtime | Memory footprint |     CPU utilization |
+---------------[ SuperFAST ]-+-----------------+----------------------------------+------------------+---------------------+
| BSC_3.3.3_AVX2_CLANG_17.0.4 |     304,827,632 |             1:10.1 / 514.9 / 6.6 |     7,710,336 KB |                743% |
+--------------------[ FAST ]-+-----------------+----------------------------------+------------------+---------------------+
| LZSSE_avx2_CLANG            |     572,282,023 |             3:04.8 / 183.3 / 1.1 |     3,331,200 KB |                 99% |
| rarlinux-x64-700b3          |     399,313,787 |            3:24.9 / 1388.2 / 3.3 |     7,658,240 KB |                678% |
+------------------[ Normal ]-+-----------------+----------------------------------+------------------+---------------------+
| 7zzs_23.01's bz2            |     414,301,737 |            8:03.3 / 3766.3 / 0.9 |        77,824 KB |                779% |
| 7zzs_23.01's gz             |     544,531,970 |           20:09.7 / 1207.4 / 0.3 |         5,376 KB |                 99% |
| 7zzs_23.01's 7z             |     366,878,089 |           24:45.1 / 1810.0 / 7.2 |    15,963,904 KB |                122% |
| zstd-v1.5.5                 |     374,058,071 |           31:29.5 / 1883.6 / 3.3 |    10,314,220 KB |                 99% |
+--------------------[ SLOW ]-+-----------------+----------------------------------+------------------+---------------------+
| BriefLZ_1.3.0_CLANG_17.0.4  |     476,307,190 |          49:20.2 / 2945.3 / 10.0 |    32,803,328 KB |                 99% |
| zpaq715_sse4.1              |     289,466,679 |           1:04:35 / 3860.0 / 9.0 |    19,023,136 KB |                 99% |
| brotli_1.1.0                |     370,294,709 |          1:07:48 / 4057.3 / 4.02 |     9,974,512 KB |                 99% |
+---------------[ UltraSLOW ]-+-----------------+----------------------------------+------------------+---------------------+
| Satanichi_CLANG_17.0.6      |     474,713,658 |        209,045 / 54,455 / 19,461 |            64+GB | 0.359 CPUs utilized |            
+-----------------------------+-----------------+----------------------------------+------------------+---------------------+

Note01a: Nakamichi thrashes the virtual RAM (since it needs ~(61-(Source-Buffer + Target-Buffer = 2 + 3)-67)=-11 gigabytes more than 64GB), seen by the 6h systemtime. Note01b: Satanichi monstrously devours physical RAM, like 3TB, in order to flex its muscles. ! RAM needed to house B-trees (relative to the file being ripped): 44N = 66,224MB; RAM needed to build B-trees IN ONE PASS: (Target-Buffer = 2,503 MB) x 64 passes = 160,192MB ! So, drastically reduced time for compression if 230 GB are available. In case of all indexes fit in RAM, the encoding speed is 100 KB/s.

Testmachinette: Laptop 'Dzvertcheto' Thinkpad L490, Intel i7-8565U (8) @ 4.600GHz, 64GB, Linux Fedora 39 Testdatafile: sha1: 8326b48e3a315f4f656013629226c319fefd483e SUPRAPIG_Delphi_Classics_Complete_Works_of_128_authors.tar (1,576,788,480 bytes)

+-----------------------------+-----------------+-----------[ Sorted by Walltime ]-+-----------------------+--------------------+----------------------------------+ 
| Decompressor                | Compressed size | Walltime / Usertime / Systemtime |       CPU utilization |       Instructions |      LLC-loads / LLC-load-misses |
+---------------[ UltraFAST ]-+-----------------+----------------------------------+-----------------------+--------------------+----------------------------------+
| LZSSE_avx2_CLANG            |     572,282,023 |                  0.8 / 0.3 / 0.4 |   1.000 CPUs utilized |      5,276,911,595 |                563,939 / 151,203 |
| LZSSE_avx2_GCC              |     572,282,023 |                  0.8 / 0.3 / 0.4 |   0.999 CPUs utilized |      5,316,121,126 |                545,723 / 140,118 |
+---------------[ SuperFAST ]-+-----------------+----------------------------------+-----------------------+--------------------+----------------------------------+
| Satanichi_GCC_13.2.1        |     474,713,658 |                  2.7 / 1.9 / 0.8 |   1.000 CPUs utilized |      4,211,049,650 |         177,744,185 / 57,211,272 |
| Satanichi_CLANG_17.0.6      |     474,713,658 |                  2.7 / 1.9 / 0.8 |   1.000 CPUs utilized |      4,243,632,674 |         179,001,727 / 57,137,600 |
| zstd-v1.5.5                 |     374,058,071 |                  2.9 / 2.5 / 0.8 |   1.175 CPUs utilized |     19,913,312,819 |           49,593,392 / 6,507,280 |
+--------------------[ FAST ]-+-----------------+----------------------------------+-----------------------+--------------------+----------------------------------+
| brotli_1.1.0                |     370,294,709 |                  6.1 / 5.2 / 0.8 |   1.000 CPUs utilized |     17,065,934,394 |         171,518,735 / 82,851,994 |
| rarlinux-x64-700b3          |     399,313,787 |                  6.7 / 9.3 / 0.9 |   1.531 CPUs utilized |     37,354,158,966 |         166,230,623 / 85,640,325 |
| BriefLZ_1.3.0_CLANG_17.0.4  |     476,307,190 |                  6.9 / 6.0 / 0.8 |   1.000 CPUs utilized |     27,125,792,763 |          88,295,646 / 31,016,221 |
| BriefLZ_1.3.0_GCC_13.2.1    |     476,307,190 |                  8.1 / 7.2 / 0.8 |   1.000 CPUs utilized |     31,513,004,141 |          90,967,111 / 32,762,390 |
| 7zzs_23.01's gz             |     544,531,970 |                  8.8 / 8.4 / 0.3 |   1.000 CPUs utilized |     60,531,034,012 |              1,131,330 / 129,222 |
| 7zzs_23.01's 7z             |     366,878,089 |                14.5 / 13.5 / 0.8 |   1.000 CPUs utilized |     76,506,480,464 |         143,437,881 / 68,732,482 |
| 7zzs_23.01's bz2            |     414,301,737 |                19.2 / 28.3 / 0.4 |   1.509 CPUs utilized |    132,876,974,414 |       1,340,889,710 / 11,315,495 |
| BSC_3.3.3_AVX2_CLANG_17.0.4 |     304,827,632 |               29.8 / 213.1 / 4.1 |   7.347 CPUs utilized |    604,969,912,535 |    2,348,629,362 / 1,233,981,644 |
+--------------------[ SLOW ]-+-----------------+----------------------------------+-----------------------+--------------------+----------------------------------+
| zpaq715_sse4.1              |     289,466,679 |            4031.6 / 4000.1 / 9.6 |   1.000 CPUs utilized | 24,939,199,778,486 | 136,354,757,447 / 28,877,270,011 |
+-----------------------------+-----------------+----------------------------------+-----------------------+--------------------+----------------------------------+

Note01: The Walltime includes LOAD-DECOMPRESS-DUMP times, that is, external-RAM -> internal-RAM -> external-RAM. Note02: The decompression is done on RamDisk of size 32GB, both the compressed and the decompressed files are on it. Note03: Comparison was made, each decompressed file was compared with the original. Note04a: The last column is quite informative, latencywise, the Last-Level-Cache misses value is indicative how much physical RAM (and cache hierarchy) stalls the CPU. Note04b: For instance, every 177,744,185 / 57,211,272 = 3.1rd attempt to load from Last-Level-Cache is denied, it says, that with bigger L3 (i7-8565U has 8 MB), Nakamichi's main bottleneck has less impact. Note05: Decompression times are the fastest of three runs, enforcing sleeping for 7 seconds in between in order to cool off. Note06: Another useful measure is DIPB which stands for Decompression-Instructions-Per-Byte, since Nakamichi is simplistic and uses no entropy stage it has the lowest 4,211,049,650/1,576,788,480=2.67 DIPB. Note07: The whole Read-Decompress-Write trio is done on RAM disk, created as follows:

sudo mkdir /tmp/ramdisk
sudo chmod 777 /tmp/ramdisk
sudo mount -t tmpfs -o size=32G myramdisk /tmp/ramdisk
#sudo umount /tmp/ramdisk/

Note08: Joergen's BriefLZ was compiled with these lines:

gcc -O3 -fomit-frame-pointer -fstrict-aliasing -o BriefLZ_1.3.0_GCC_13.2.1_64bit.elf -I../include blzpack.c parg.c ../src/brieflz.c ../src/depack.c ../src/depacks.c -D_N_HIGH_PRIORITY
clang -O3 -fomit-frame-pointer -fstrict-aliasing -o BriefLZ_1.3.0_CLANG_17.0.4_64bit.elf -I../include blzpack.c parg.c ../src/brieflz.c ../src/depack.c ../src/depacks.c -D_N_HIGH_PRIORITY

Bottomlines:

Conor's nifty tool LZSSE reigns (hate to miss LZTurbo (wonderful BWTSatan as well) and Oodle) supreme, fully (Read-Decompress-Write) decompresses at 1,576,788,480 bytes / 0.8 seconds = 1879 MiB/s, being a single-threaded bonbon; should LZSSE be threaded, it would scream insanely;
Excellent work, Gribok, thank you for your superuseful tool – a multi-threaded bonboniera;
Matt Mahoney’s ZPAQ is the OG, never outside the ... MIX;
RARwise, Roshal brothers never disappoint, as far as I see, they aim at speed mostly, very fast all around;
Satanichi (being the latest Nakamichi) fares well, being 8.8/2.7=3.2x faster than 7zzs_23.01's gz, however, Igor Pavlov’s implementation is inferior to Eric Biggers’ libdeflate, which in some cases is even faster than my toy, couldn’t include it;
And, regarding the impact of Last-Level-Cache, surely BSC will scream even louder with those huge 3D cache CPUs, those 1+ billion LLC misses are a drag;

Obviously, WhiskeyLake rocks, being only 25W.

Oh, wanted to include the Fabrice Bellard's superthrasher NNCP... somenight.

2023-Dec-30, Kaze (sanmayce@sanmayce.com)

ConorStokes / LZSSE

Excellent. An attempt for a console tool. #8